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Preface 


The seeds for this book were first planted in 2001 when Steve Seitz at the University of Wash- 
ington invited me to co-teach a course called “Computer Vision for Computer Graphics”. At 
that time, computer vision techniques were increasingly being used in computer graphics to 
create image-based models of real-world objects, to create visual effects, and to merge real- 
world imagery using computational photography techniques. Our decision to focus on the 
applications of computer vision to fun problems such as image stitching and photo-based 3D 
modeling from personal photos seemed to resonate well with our students. 

Since that time, a similar syllabus and project-oriented course structure has been used to 
teach general computer vision courses both at the University of Washington and at Stanford. 
(The latter was a course I co-taught with David Fleet in 2003.) Similar curricula have been 
adopted at a number of other universities and also incorporated into more specialized courses 
on computational photography. (For ideas on how to use this book in your own course, please 
see Table 1.1 in Section 1.4.) 

This book also reflects my 20 years’ experience doing computer vision research in corpo- 
rate research labs, mostly at Digital Equipment Corporation’s Cambridge Research Lab and 
at Microsoft Research. In pursuing my work, I have mostly focused on problems and solu- 
tion techniques (algorithms) that have practical real-world applications and that work well in 
practice. Thus, this book has more emphasis on basic techniques that work under real-world 
conditions and less on more esoteric mathematics that has intrinsic elegance but less practical 
applicability. 

This book is suitable for teaching a senior-level undergraduate course in computer vision 
to students in both computer science and electrical engineering. I prefer students to have 
either an image processing or a computer graphics course as a prerequisite so that they can 
spend less time learning general background mathematics and more time studying computer 
vision techniques. The book is also suitable for teaching graduate-level courses in computer 
vision (by delving into the more demanding application and algorithmic areas) and as a gen- 
eral reference to fundamental techniques and the recent research literature. To this end, I have 
attempted wherever possible to at least cite the newest research in each sub-field, even if the 
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technical details are too complex to cover in the book itself. 

In teaching our courses, we have found it useful for the students to attempt a number of 
small implementation projects, which often build on one another, in order to get them used to 
working with real-world images and the challenges that these present. The students are then 
asked to choose an individual topic for each of their small-group, final projects. (Sometimes 
these projects even turn into conference papers!) The exercises at the end of each chapter 
contain numerous suggestions for smaller mid-term projects, as well as more open-ended 
problems whose solutions are still active research topics. Wherever possible, I encourage 
students to try their algorithms on their own personal photographs, since this better motivates 
them, often leads to creative variants on the problems, and better acquaints them with the 
variety and complexity of real-world imagery. 

In formulating and solving computer vision problems, I have often found it useful to draw 
inspiration from three high-level approaches: 

• Scientific: build detailed models of the image formation process and develop mathe- 
matical techniques to invert these in order to recover the quantities of interest (where 
necessary, making simplifying assumption to make the mathematics more tractable). 

• Statistical: use probabilistic models to quantify the prior likelihood of your unknowns 
and the noisy measurement processes that produce the input images, then infer the best 
possible estimates of your desired quantities and analyze their resulting uncertainties. 
The inference algorithms used are often closely related to the optimization techniques 
used to invert the (scientific) image formation processes. 

• Engineering: develop techniques that are simple to describe and implement but that 
are also known to work well in practice. Test these techniques to understand their 
limitation and failure modes, as well as their expected computational costs (run-time 
performance). 

These three approaches build on each other and are used throughout the book. 

My personal research and development philosophy (and hence the exercises in the book) 
have a strong emphasis on testing algorithms. It’s too easy in computer vision to develop an 
algorithm that does something plausible on a few images rather than something correct. The 
best way to validate your algorithms is to use a three-part strategy. 

First, test your algorithm on clean synthetic data, for which the exact results are known. 
Second, add noise to the data and evaluate how the performance degrades as a function of 
noise level. Finally, test the algorithm on real-world data, preferably drawn from a wide 
variety of sources, such as photos found on the Web. Only then can you truly know if your 
algorithm can deal with real-world complexity, i.e., images that do not fit some simplified 
model or assumptions. 
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IX 


In order to help students in this process, this books comes with a large amount of supple- 
mentary material, which can be found on the book’s Web site http://szeliski.org/Book. This 
material, which is described in Appendix C, includes: 

• pointers to commonly used data sets for the problems, which can be found on the Web 

• pointers to software libraries, which can help students get started with basic tasks such 
as reading/writing images or creating and manipulating images 

• slide sets corresponding to the material covered in this book 

• a BibTeX bibliography of the papers cited in this book. 

The latter two resources may be of more interest to instructors and researchers publishing 
new papers in this field, but they will probably come in handy even with regular students. 
Some of the software libraries contain implementations of a wide variety of computer vision 
algorithms, which can enable you to tackle more ambitious projects (with your instructor’s 
consent). 
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Figure 1.1 The human visual system has no problem interpreting the subtle variations in 
translucency and shading in this photograph and correctly segmenting the object from its 
background. 
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(c) (d) 


Figure 1.2 Some examples of computer vision algorithms and applications, (a) Structure 
from motion algorithms can reconstruct a sparse 3D point model of a large complex scene 
from hundreds of partially overlapping photographs (Snavely, Seitz, and Szeliski 2006) © 
2006 ACM. (b) Stereo matching algorithms can build a detailed 3D model of a building facade 
from hundreds of differently exposed photographs taken from the Internet (Goesele, Snavely, 
Curless et al. 2007) © 2007 IEEE, (c) Person tracking algorithms can track a person walking 
in front of a cluttered background (Sidenbladh, Black, and Fleet 2000) © 2000 Springer, (d) 
Face detection algorithms, coupled with color-based clothing and hair detection algorithms, 
can locate and recognize the individuals in this image (Sivic, Zitnick, and Szeliski 2006) © 
2006 Springer. 
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1.1 What is computer vision? 

As humans, we perceive the three-dimensional structure of the world around us with apparent 
ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers 
sitting on the table next to you. You can tell the shape and translucency of each petal through 
the subtle patterns of light and shading that play across its surface and effortlessly segment 
each flower from the background of the scene (Figure 1.1). Looking at a framed group por- 
trait, you can easily count (and name) all of the people in the picture and even guess at their 
emotions from their facial appearance. Perceptual psychologists have spent decades trying to 
understand how the visual system works and, even though they can devise optical illusions 1 
to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains 
elusive (Marr 1982; Palmer 1999; Livingstone 2008). 

Researchers in computer vision have been developing, in parallel, mathematical tech- 
niques for recovering the three-dimensional shape and appearance of objects in imagery. We 
now have reliable techniques for accurately computing a partial 3D model of an environment 
from thousands of partially overlapping photographs (Figure 1.2a). Given a large enough 
set of views of a particular object or fagade, we can create accurate dense 3D surface mod- 
els using stereo matching (Figure 1.2b). We can track a person moving against a complex 
background (Figure 1.2c). We can even, with moderate success, attempt to find and name 
all of the people in a photograph using a combination of face, clothing, and hair detection 
and recognition (Figure 1.2d). However, despite all of these advances, the dream of having a 
computer interpret an image at the same level as a two-year old (for example, counting all of 
the animals in a picture) remains elusive. Why is vision so difficult? In part, it is because 
vision is an inverse problem , in which we seek to recover some unknowns given insufficient 
information to fully specify the solution. We must therefore resort to physics-based and prob- 
abilistic models to disambiguate between potential solutions. However, modeling the visual 
world in all of its rich complexity is far more difficult than, say, modeling the vocal tract that 
produces spoken sounds. 

The forward models that we use in computer vision are usually developed in physics (ra- 
diometry, optics, and sensor design) and in computer graphics. Both of these fields model 
how objects move and animate, how light reflects off their surfaces, is scattered by the at- 
mosphere, refracted through camera lenses (or human eyes), and finally projected onto a flat 
(or curved) image plane. While computer graphics are not yet perfect (no fully computer- 
animated movie with human characters has yet succeeded at crossing the uncanny valley 2 
that separates real humans from android robots and computer-animated humans), in limited 

1 http://www.michaelbach.de/ot/sze_muelue 

2 The term uncanny valley was originally coined by roboticist Masahiro Mori as applied to robotics (Mori 1970). 
It is also commonly applied to computer-animated films such as Final Fantasy and Polar Express (Geller 2008). 
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Figure 1.3 Some common optical illusions and what they might tell us about the visual sys- 
tem: (a) The classic Miiller-Lyer illusion, where the length of the two horizontal lines appear 
different, probably due to the imagined perspective effects, (b) The “white” square B in the 
shadow and the “black” square A in the light actually have the same absolute intensity value. 
The percept is due to brightness constancy, the visual system’s attempt to discount illumi- 
nation when interpreting colors. Image courtesy of Ted Adelson, http://web.mit.edu/persci/ 
people/adelson/checkershadow Jllusion.html. (c) A variation of the Hermann grid illusion, 
courtesy of Hany Farid, http://www.cs.dartmouth.edu/~farid/illusions/hermann.html. As you 
move your eyes over the figure, gray spots appear at the intersections, (d) Count the red As 
in the left half of the figure. Now count them in the right half. Is it significantly harder? 
The explanation has to do with a pop-out effect (Treisman 1985), which tells us about the 
operations of parallel perception and integration pathways in the brain. 
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domains, such as rendering a still scene composed of everyday objects or animating extinct 
creatures such as dinosaurs, the illusion of reality is perfect. 

In computer vision, we are trying to do the inverse, i.e., to describe the world that we see 
in one or more images and to reconstruct its properties, such as shape, illumination, and color 
distributions. It is amazing that humans and animals do this so effortlessly, while computer 
vision algorithms are so error prone. People who have not worked in the field often under- 
estimate the difficulty of the problem. (Colleagues at work often ask me for software to find 
and name all the people in photos, so they can get on with the more “interesting” work.) This 
misperception that vision should be easy dates back to the early days of artificial intelligence 
(see Section 1.2), when it was initially believed that the cognitive (logic proving and plan- 
ning) parts of intelligence were intrinsically more difficult than the perceptual components 
(Boden 2006). 

The good news is that computer vision is being used today in a wide variety of real-world 
applications, which include: 

• Optical character recognition (OCR): reading handwritten postal codes on letters 
(Figure 1.4a) and automatic number plate recognition (ANPR); 

• Machine inspection: rapid parts inspection for quality assurance using stereo vision 
with specialized illumination to measure tolerances on aircraft wings or auto body parts 
(Figure 1.4b) or looking for defects in steel castings using X-ray vision; 

• Retail: object recognition for automated checkout lanes (Figure 1.4c); 

• 3D model building (photogrammetry): fully automated construction of 3D models 
from aerial photographs used in systems such as Bing Maps; 

• Medical imaging: registering pre -operative and intra-operative imagery (Figure 1.4d) 
or performing long-term studies of people’s brain morphology as they age; 

• Automotive safety: detecting unexpected obstacles such as pedestrians on the street, 
under conditions where active vision techniques such as radar or lidar do not work 
well (Figure 1.4e; see also Miller, Campbell, Huttenlocher et al. (2008); Montemerlo, 
Becker, Bhat et al. (2008); Urmson, Anhalt, Bagnell et al. (2008) for examples of fully 
automated driving); 

• Match move: merging computer-generated imagery (CGI) with live action footage by 
tracking feature points in the source video to estimate the 3D camera motion and shape 
of the environment. Such techniques are widely used in Hollywood (e.g., in movies 
such as Jurassic Park) (Roble 1999; Roble and Zafar 2009); they also require the use of 
precise matting to insert new elements between foreground and background elements 
(Chuang, Agarwala, Curless et al. 2002). 
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Figure 1.4 Some industrial applications of computer vision: (a) optical character recognition 
(OCR) http://yann.lecun.com/exdb/lenet/; (b) mechanical inspection http://www.cognitens. 
com/; (c) retail http://www.evoretail.com/; (d) medical imaging http://www.clarontech.com/; 
(e) automotive safety http://www.mobileye.com/; (f) surveillance and traffic monitoring http: 
//www.honey wellvideo.com/, courtesy of Honeywell International Inc. 







1.1 What is computer vision? 


7 


• Motion capture (mocap): using retro-reflective markers viewed from multiple cam- 
eras or other vision-based techniques to capture actors for computer animation; 

• Surveillance: monitoring for intruders, analyzing highway traffic (Figure 1.4f), and 
monitoring pools for drowning victims; 

• Fingerprint recognition and biometrics: for automatic access authentication as well 
as forensic applications. 

David Lowe’s Web site of industrial vision applications (http://www.cs.ubc.ca/spider/lowe/ 
vision.html) lists many other interesting industrial applications of computer vision. While the 
above applications are all extremely important, they mostly pertain to fairly specialized kinds 
of imagery and narrow domains. 

In this book, we focus more on broader consumer-level applications, such as fun things 
you can do with your own personal photographs and video. These include: 

• Stitching: turning overlapping photos into a single seamlessly stitched panorama (Fig- 
ure 1.5a), as described in Chapter 9; 

• Exposure bracketing: merging multiple exposures taken under challenging lighting 
conditions (strong sunlight and shadows) into a single perfectly exposed image (Fig- 
ure 1.5b), as described in Section 10.2; 

• Morphing: turning a picture of one of your friends into another, using a seamless 
morph transition (Figure 1.5c); 

• 3D modeling: converting one or more snapshots into a 3D model of the object or 
person you are photographing (Figure 1.5d), as described in Section 12.6 

• Video match move and stabilization: inserting 2D pictures or 3D models into your 
videos by automatically tracking nearby reference points (see Section 7.4.2) 3 or using 
motion estimates to remove shake from your videos (see Section 8.2.1); 

• Photo-based walkthroughs: navigating a large collection of photographs, such as the 
interior of your house, by flying between different photos in 3D (see Sections 13.1.2 
and 13.5.5) 

• Face detection: for improved camera focusing as well as more relevant image search- 
ing (see Section 14.1.1); 

• Visual authentication: automatically logging family members onto your home com- 
puter as they sit down in front of the webcam (see Section 14.2). 


8 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 




Figure 1.5 Some consumer applications of computer vision: (a) image stitching: merging 
different views (Szeliski and Shum 1997) © 1997 ACM; (b) exposure bracketing: merging 
different exposures; (c) morphing: blending between two photographs (Gomes, Darsa, Costa 
el al. 1999) © 1999 Morgan Kaufmann; (d) turning a collection of photographs into a 3D 
model (Sinha, Steedly, Szeliski et al. 2008) © 2008 ACM. 
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The great thing about these applications is that they are already familiar to most students; 
they are, at least, technologies that students can immediately appreciate and use with their 
own personal media. Since computer vision is a challenging topic, given the wide range 
of mathematics being covered 3 4 and the intrinsically difficult nature of the problems being 
solved, having fun and relevant problems to work on can be highly motivating and inspiring. 

The other major reason why this book has a strong focus on applications is that they can 
be used to formulate and constrain the potentially open-ended problems endemic in vision. 
For example, if someone comes to me and asks for a good edge detector, my first question is 
usually to ask why ? What kind of problem are they trying to solve and why do they believe 
that edge detection is an important component? If they are trying to locate faces, I usually 
point out that most successful face detectors use a combination of skin color detection (Exer- 
cise 2.8) and simple blob features Section 14.1.1; they do not rely on edge detection. If they 
are trying to match door and window edges in a building for the purpose of 3D reconstruction, 
I tell them that edges are a fine idea but it is better to tune the edge detector for long edges 
(see Sections 3.2.3 and 4.2) and link them together into straight lines with common vanishing 
points before matching (see Section 4.3). 

Thus, it is better to think back from the problem at hand to suitable techniques, rather 
than to grab the first technique that you may have heard of. This kind of working back from 
problems to solutions is typical of an engineering approach to the study of vision and reflects 
my own background in the field. First, I come up with a detailed problem definition and 
decide on the constraints and specifications for the problem. Then, I try to find out which 
techniques are known to work, implement a few of these, evaluate their performance, and 
finally make a selection. In order for this process to work, it is important to have realistic test 
data, both synthetic, which can be used to verify correctness and analyze noise sensitivity, 
and real-world data typical of the way the system will finally be used. 

However, this book is not just an engineering text (a source of recipes). It also takes a 
scientific approach to basic vision problems. Here, I try to come up with the best possible 
models of the physics of the system at hand: how the scene is created, how light interacts 
with the scene and atmospheric effects, and how the sensors work, including sources of noise 
and uncertainty. The task is then to try to invert the acquisition process to come up with the 
best possible description of the scene. 

The book often uses a statistical approach to formulating and solving computer vision 
problems. Where appropriate, probability distributions are used to model the scene and the 
noisy image acquisition process. The association of prior distributions with unknowns is often 

3 For a fun student project on this topic, see the “PhotoBook” project at http://www.cc.gatech.edu/dvfx/videos/ 
dvfx2005.html. 

4 These techniques include physics. Euclidean and projective geometry, statistics, and optimization. They make 
computer vision a fascinating field to study and a great way to learn techniques widely applicable in other fields. 
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called Bayesian modeling (Appendix B). It is possible to associate a risk or loss function with 
mis-estimating the answer (Section B.2) and to set up your inference algorithm to minimize 
the expected risk. (Consider a robot trying to estimate the distance to an obstacle: it is 
usually safer to underestimate than to overestimate.) With statistical techniques, it often helps 
to gather lots of training data from which to learn probabilistic models. Finally, statistical 
approaches enable you to use proven inference techniques to estimate the best answer (or 
distribution of answers) and to quantify the uncertainty in the resulting estimates. 

Because so much of computer vision involves the solution of inverse problems or the esti- 
mation of unknown quantities, my book also has a heavy emphasis on algorithms, especially 
those that are known to work well in practice. For many vision problems, it is all too easy to 
come up with a mathematical description of the problem that either does not match realistic 
real-world conditions or does not lend itself to the stable estimation of the unknowns. What 
we need are algorithms that are both robust to noise and deviation from our models and rea- 
sonably efficient in terms of run-time resources and space. In this book, I go into these issues 
in detail, using Bayesian techniques, where applicable, to ensure robustness, and efficient 
search, minimization, and linear system solving algorithms to ensure efficiency. Most of the 
algorithms described in this book are at a high level, being mostly a list of steps that have to 
be filled in by students or by reading more detailed descriptions elsewhere. In fact, many of 
the algorithms are sketched out in the exercises. 

Now that I’ve described the goals of this book and the frameworks that I use, I devote the 
rest of this chapter to two additional topics. Section 1.2 is a brief synopsis of the history of 
computer vision. It can easily be skipped by those who want to get to “the meat” of the new 
material in this book and do not care as much about who invented what when. 

The second is an overview of the book’s contents, Section 1.3, which is useful reading for 
everyone who intends to make a study of this topic (or to jump in partway, since it describes 
chapter inter-dependencies). This outline is also useful for instructors looking to structure 
one or more courses around this topic, as it provides sample curricula based on the book’s 
contents. 


1.2 A brief history 

In this section, I provide a brief personal synopsis of the main developments in computer 
vision over the last 30 years (Figure 1.6); at least, those that I find personally interesting 
and which appear to have stood the test of time. Readers not interested in the provenance 
of various ideas and the evolution of this field should skip ahead to the book overview in 
Section 1.3. 
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Figure 1.6 A rough timeline of some of the most active topics of research in computer 


1970s. When computer vision first started out in the early 1970s, it was viewed as the 
visual perception component of an ambitious agenda to mimic human intelligence and to 
endow robots with intelligent behavior. At the time, it was believed by some of the early 
pioneers of artificial intelligence and robotics (at places such as MIT, Stanford, and CMU) 
that solving the “visual input” problem would be an easy step along the path to solving more 
difficult problems such as higher-level reasoning and planning. According to one well-known 
story, in 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman 
to “spend the summer linking a camera to a computer and getting the computer to describe 
what it saw” (Boden 2006, p. 781). 5 We now know that the problem is slightly more difficult 
than that. 6 

What distinguished computer vision from the already existing field of digital image pro- 
cessing (Rosenfeld and Pfaltz 1966; Rosenfeld and Kak 1976) was a desire to recover the 
three-dimensional structure of the world from images and to use this as a stepping stone to- 
wards full scene understanding. Winston (1975) and Hanson and Riseman (1978) provide 
two nice collections of classic papers from this early period. 

Early attempts at scene understanding involved extracting edges and then inferring the 
3D structure of an object or a “blocks world” from the topological stmcture of the 2D lines 
(Roberts 1965). Several line labeling algorithms (Figure 1.7a) were developed at that time 
(Huffman 1971; Clowes 1971; Waltz 1975; Rosenfeld, Hummel, and Zucker 1976; Kanade 
1980). Nalwa (1993) gives a nice review of this area. The topic of edge detection was also 

5 Boden (2006) cites (Crevier 1993) as the original source. The actual Vision Memo was authored by Seymour 
Papert (1966) and involved a whole cohort of students. 

6 To see how far robotic vision has come in the last four decades, have a look at the towel-folding robot at 
http://rll.eecs.berkeley.edu/pr/icralO/ (Maitin-Shepard, Cusumano-Towner, Lei et al. 2010). 


12 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 






Figure 1.7 Some early (1970s) examples of computer vision algorithms: (a) line label- 
ing (Nalwa 1993) © 1993 Addison- Wesley, (b) pictorial structures (Fischler and Elschlager 
1973) © 1973 IEEE, (c) articulated body model (Marr 1982) © 1982 David Marr, (d) intrin- 
sic images (Barrow and Tenenbaum 1981) © 1973 IEEE, (e) stereo correspondence (Marr 
1982) © 1982 David Marr, (f) optical flow (Nagel and Enkelmann 1986) © 1986 IEEE. 


an active area of research; a nice survey of contemporaneous work can be found in (Davis 
1975). 

Three-dimensional modeling of non-polyhedral objects was also being studied (Baum- 
gart 1974; Baker 1977). One popular approach used generalized cylinders, i.e., solids of 
revolution and swept closed curves (Agin and Binford 1976; Nevada and Binford 1977), of- 
ten arranged into parts relationships 7 (Hinton 1977; Marr 1982) (Figure 1.7c). Fischler and 
Elschlager (1973) called such elastic arrangements of parts pictorial structures (Figure 1.7b). 
This is currently one of the favored approaches being used in object recognition (see Sec- 
tion 14.4 and Felzenszwalb and Huttenlocher 2005). 

A qualitative approach to understanding intensities and shading variations and explaining 
them by the effects of image formation phenomena, such as surface orientation and shadows, 
was championed by Barrow and Tenenbaum (1981) in their paper on intrinsic images (Fig- 
ure 1.7d), along with the related 2 A -I) sketch ideas of Marr (1982). This approach is again 
seeing a bit of a revival in the work of Tappen, Freeman, and Adelson (2005). 

More quantitative approaches to computer vision were also developed at the time, in- 
cluding the first of many feature-based stereo correspondence algorithms (Figure 1.7e) (Dev 

7 In robotics and computer animation, these linked-part graphs are often called kinematic chains. 
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1974; Marr and Poggio 1976; Moravec 1977; Marr and Poggio 1979; Mayhew and Frisby 
1981; Baker 1982; Barnard and Fischler 1982; Ohta and Kanade 1985; Grimson 1985; Pol- 
lard, Mayhew, and Frisby 1985; Prazdny 1985) and intensity-based optical flow algorithms 
(Figure 1.7f) (Horn and Schunck 1981; Huang 1981; Lucas and Kanade 1981; Nagel 1986). 
The early work in simultaneously recovering 3D structure and camera motion (see Chapter 7) 
also began around this time (Ullman 1979; Longuet-Higgins 1981). 

A lot of the philosophy of how vision was believed to work at the time is summarized 
in David Marr’s (1982) book.' In particular, Marr introduced his notion of the three levels 
of description of a (visual) information processing system. These three levels, very loosely 
paraphrased according to my own interpretation, are: 

• Computational theory: What is the goal of the computation (task) and what are the 
constraints that are known or can be brought to bear on the problem? 

• Representations and algorithms: How are the input, output, and intermediate infor- 
mation represented and which algorithms are used to calculate the desired result? 

• Hardware implementation: How are the representations and algorithms mapped onto 
actual hardware, e.g., a biological vision system or a specialized piece of silicon? Con- 
versely, how can hardware constraints be used to guide the choice of representation 
and algorithm? With the increasing use of graphics chips (GPUs) and many-core ar- 
chitectures for computer vision (see Section C.2), this question is again becoming quite 
relevant. 

As I mentioned earlier in this introduction, it is my conviction that a careful analysis of the 
problem specification and known constraints from image formation and priors (the scientific 
and statistical approaches) must be married with efficient and robust algorithms (the engineer- 
ing approach) to design successful vision algorithms. Thus, it seems that Marr’s philosophy 
is as good a guide to framing and solving problems in our field today as it was 25 years ago. 

1980s. In the 1980s, a lot of attention was focused on more sophisticated mathematical 
techniques for performing quantitative image and scene analysis. 

Image pyramids (see Section 3.5) started being widely used to perform tasks such as im- 
age blending (Figure 1.8a) and coarse-to-fine correspondence search (Rosenfeld 1980; Burt 
and Adelson 1983a,b; Rosenfeld 1984; Quam 1984; Anandan 1989). Continuous versions 
of pyramids using the concept of scale-space processing were also developed (Witkin 1983; 
Witkin, Terzopoulos, and Kass 1986; Lindeberg 1990). In the late 1980s, wavelets (see Sec- 
tion 3.5.4) started displacing or augmenting regular image pyramids in some applications 


More recent developments in visual perception theory are covered in (Palmer 1999; Livingstone 2008). 
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Figure 1.8 Examples of computer vision algorithms from the 1980s: (a) pyramid blending 
(Burt and Adelson 1983b) © 1983 ACM, (b) shape from shading (Freeman and Adelson 
1991) © 1991 IEEE, (c) edge detection (Freeman and Adelson 1991) © 1991 IEEE, (d) 
physically based models (Terzopoulos and Witkin 1988) © 1988 IEEE, (e) regularization- 
based surface reconstruction (Terzopoulos 1988) © 1988 IEEE, (f) range data acquisition 
and merging (Banno, Masuda, Oishi el al. 2008) © 2008 Springer. 


(Adelson, Simoncelli, and Hingorani 1987; Mallat 1989; Simoncelli and Adelson 1990a, b; 
Simoncelli, Freeman, Adelson et al. 1992). 

The use of stereo as a quantitative shape cue was extended by a wide variety of shape- 
from-X techniques, including shape from shading (Figure 1.8b) (see Section 12.1.1 and Horn 
1975; Pentland 1984; Blake, Zimmerman, and Knowles 1985; Horn and Brooks 1986, 1989), 
photometric stereo (see Section 12.1.1 and Woodham 1981), shape from texture (see Sec- 
tion 12.1.2 and Witkin 1981; Pentland 1984; Malik and Rosenholtz 1997), and shape from 
focus (see Section 12.1.3 and Nayar, Watanabe, and Noguchi 1995). Horn (1986) has a nice 
discussion of most of these techniques. 

Research into better edge and contour detection (Figure 1.8c) (see Section 4.2) was also 
active during this period (Canny 1986; Nalwa and Binford 1986), including the introduc- 
tion of dynamically evolving contour trackers (Section 5.1.1) such as snakes (Kass, Witkin, 
and Terzopoulos 1988), as well as three-dimensional physically based models (Figure 1.8d) 
(Terzopoulos, Witkin, and Kass 1987; Kass, Witkin, and Terzopoulos 1988; Terzopoulos and 
Fleischer 1988; Terzopoulos, Witkin, and Kass 1988). 

Researchers noticed that a lot of the stereo, flow, shape-from-X, and edge detection al- 
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gorithms could be unified, or at least described, using the same mathematical framework if 
they were posed as variational optimization problems (see Section 3.7) and made more ro- 
bust (well-posed) using regularization (Figure 1.8e) (see Section 3.7.1 and Terzopoulos 1983; 
Poggio, Torre, and Koch 1985; Terzopoulos 1986b; Blake and Zisserman 1987; Bertero, Pog- 
gio, and Torre 1988; Terzopoulos 1988). Around the same time, Geman and Geman (1984) 
pointed out that such problems could equally well be formulated using discrete Markov Ran- 
dom Field (MRF) models (see Section 3.7.2), which enabled the use of better (global) search 
and optimization algorithms, such as simulated annealing. 

Online variants of MRF algorithms that modeled and updated uncertainties using the 
Kalman filter were introduced a little later (Dickmanns and Graefe 1988; Matthies, Kanade, 
and Szeliski 1989; Szeliski 1989). Attempts were also made to map both regularized and 
MRF algorithms onto parallel hardware (Poggio and Koch 1985; Poggio, Little, Gamble 
el al. 1988; Fischler, Firschein, Barnard et al. 1989). The book by Fischler and Firschein 
(1987) contains a nice collection of articles focusing on all of these topics (stereo, flow, 
regularization, MRFs, and even higher-level vision). 

Three-dimensional range data processing (acquisition, merging, modeling, and recogni- 
tion; see Figure 1.8f) continued being actively explored during this decade (Agin and Binford 
1976; Besl and Jain 1985; Faugeras and Hebert 1987; Curless and Levoy 1996). The compi- 
lation by Kanade (1987) contains a lot of the interesting papers in this area. 

1990s. While a lot of the previously mentioned topics continued to be explored, a few of 
them became significantly more active. 

A burst of activity in using projective invariants for recognition (Mundy and Zisserman 
1992) evolved into a concerted effort to solve the structure from motion problem (see Chap- 
ter 7). A lot of the initial activity was directed at projective reconstructions, which did not 
require knowledge of camera calibration (Faugeras 1992; Hartley, Gupta, and Chang 1992; 
Hartley 1994a; Faugeras and Luong 2001 ; Hartley and Zisserman 2004). Simultaneously, /ac- 
torization techniques (Section 7.3) were developed to solve efficiently problems for which or- 
thographic camera approximations were applicable (Figure 1.9a) (Tomasi and Kanade 1992; 
Poelman and Kanade 1997; Anandan and Irani 2002) and then later extended to the perspec- 
tive case (Christy and Horaud 1996; Triggs 1996). Eventually, the field started using full 
global optimization (see Section 7.4 and Taylor, Kriegman, and Anandan 1991; Szeliski and 
Kang 1994; Azarbayejani and Pentland 1995), which was later recognized as being the same 
as the bundle adjustment techniques traditionally used in photogrammetry (Triggs, McLauch- 
lan. Hartley et al. 1999). Fully automated (sparse) 3D modeling systems were built using such 
techniques (Beardsley, Torr, and Zisserman 1996; Schaffalitzky and Zisserman 2002; Brown 
and Lowe 2003; Snavely, Seitz, and Szeliski 2006). 

Work begun in the 1980s on using detailed measurements of color and intensity combined 
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Figure 1.9 Examples of computer vision algorithms from the 1990s: (a) factorization-based 
structure from motion (Tomasi and Kanade 1992) © 1992 Springer, (b) dense stereo match- 
ing (Boykov, Veksler, and Zabih 2001), (c) multi-view reconstruction (Seitz and Dyer 1999) 
© 1999 Springer, (d) face tracking (Matthews, Xiao, and Baker 2007), (e) image segmenta- 
tion (Belongie, Fowlkes, Chung et al. 2002) © 2002 Springer, (f) face recognition (Turk and 
Pentland 1991a). 


with accurate physical models of radiance transport and color image formation created its own 
subfield known as physics-based vision. A good survey of the field can be found in the three- 
volume collection on this topic (Wolff, Shafer, and Healey 1992a; Healey and Shafer 1992; 
Shafer, Healey, and Wolff 1992). 

Optical flow methods (see Chapter 8) continued to be improved (Nagel and Enkelmann 
1986; Bolles, Baker, and Marimont 1987; Horn and Weldon Jr. 1988; Anandan 1989; Bergen, 
Anandan, Hanna el al. 1992; Black and Anandan 1996; Bruhn, Weickert, and Schnorr 2005; 
Papenberg, Bruhn, Brox el al. 2006), with (Nagel 1986; Barron, Fleet, and Beauchemin 1994; 
Baker, Black, Lewis el al. 2007) being good surveys. Similarly, a lot of progress was made 
on dense stereo correspondence algorithms (see Chapter 11, Okutomi and Kanade (1993, 
1994); Boykov, Veksler, and Zabih (1998); Birchfield and Tomasi (1999); Boykov, Veksler, 
and Zabih (2001), and the survey and comparison in Scharstein and Szeliski (2002)), with 
the biggest breakthrough being perhaps global optimization using graph cut techniques (Fig- 
ure 1.9b) (Boykov, Veksler, and Zabih 2001). 
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Multi-view stereo algorithms (Figure 1.9c) that produce complete 3D surfaces (see Sec- 
tion 11.6) were also an active topic of research (Seitz and Dyer 1999; Kutulakos and Seitz 
2000) that continues to be active today (Seitz, Curless, Diebel et al. 2006). Techniques for 
producing 3D volumetric descriptions from binary silhouettes (see Section 1 1.6.2) continued 
to be developed (Potmesil 1987; Srivasan, Liang, and Hackwood 1990; Szeliski 1993; Lau- 
rentini 1994), along with techniques based on tracking and reconstructing smooth occluding 
contours (see Section 1 1.2.1 and Cipolla and Blake 1992; Vaillant and Faugeras 1992; Zheng 
1994; Boyer and Berger 1997; Szeliski and Weiss 1998; Cipolla and Giblin 2000). 

Tracking algorithms also improved a lot, including contour tracking using active contours 
(see Section 5.1), such as snakes (Kass, Witkin, and Terzopoulos 1988 ), particle filters (Blake 
and Isard 1998), and level sets (Malladi, Sethian, and Vemuri 1995), as well as intensity-based 
( direct ) techniques (Lucas and Kanade 1981; Shi and Tomasi 1994; Rehg and Kanade 1994), 
often applied to tracking faces (Figure 1.9d) (Lanitis, Taylor, and Cootes 1997; Matthews and 
Baker 2004; Matthews, Xiao, and Baker 2007) and whole bodies (Sidenbladh, Black, and 
Fleet 2000; Hilton, Fua, and Ronfard 2006; Moeslund, Hilton, and Kruger 2006). 

Image segmentation (see Chapter 5) (Figure 1.9e), a topic which has been active since 
the earliest days of computer vision (Brice and Fennema 1970; Horowitz and Pavlidis 1976; 
Riseman and Arbib 1977; Rosenfeld and Davis 1979; Haralick and Shapiro 1985; Pavlidis 
and Liow 1990), was also an active topic of research, producing techniques based on min- 
imum energy (Mumford and Shah 1989) and minimum description length (Leclerc 1989), 
normalized cuts (Shi and Malik 2000), and mean shift (Comaniciu and Meer 2002). 

Statistical learning techniques started appearing, first in the application of principal com- 
ponent eigenface analysis to face recognition (Figure 1.9f) (see Section 14.2.1 and Turk and 
Pentland 1991a) and linear dynamical systems for curve tracking (see Section 5.1.1 and Blake 
and Isard 1998). 

Perhaps the most notable development in computer vision during this decade was the 
increased interaction with computer graphics (Seitz and Szeliski 1999), especially in the 
cross-disciplinary area of image-based modeling and rendering (see Chapter 13). The idea of 
manipulating real-world imagery directly to create new animations first came to prominence 
with image morphing techniques (Figurel.5c) (see Section 3.6.3 and Beier and Neely 1992) 
and was later applied to view interpolation (Chen and Williams 1993; Seitz and Dyer 1996), 
panoramic image stitching (Figurel.5a) (see Chapter 9 and Mann and Picard 1994; Chen 
1995; Szeliski 1996; Szeliski and Shum 1997; Szeliski 2006a), and full light-held rendering 
(Figure 1.10a) (see Section 13.3 and Gortler, Grzeszczuk, Szeliski et al. 1996; Levoy and 
Hanrahan 1996; Shade, Gortler, He et al. 1998). At the same time, image-based modeling 
techniques (Figure 1.10b) for automatically creating realistic 3D models from collections of 
images were also being introduced (Beardsley, Torr, and Zisserman 1996; Debevec, Taylor, 
and Malik 1996; Taylor, Debevec, and Malik 1996). 
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(d) (e) (f) 


Figure 1.10 Recent examples of computer vision algorithms: (a) image-based rendering 
(Gortler, Grzeszczuk, Szeliski et al. 1996), (b) image-based modeling (Debevec, Taylor, and 
Malik 1996) © 1996 ACM, (c) interactive tone mapping (Lischinski, Farbman, Uyttendaele 
et al. 2006a) (d) texture synthesis (Efros and Freeman 2001), (e) feature-based recognition 
(Fergus, Perona, and Zisserman 2007), (f) region-based recognition (Mori, Ren, Efros et al. 
2004) © 2004 IEEE. 


2000s. This past decade has continued to see a deepening interplay between the vision and 
graphics fields. In particular, many of the topics introduced under the rubric of image-based 
rendering, such as image stitching (see Chapter 9), light-field capture and rendering (see 
Section 13.3), and high dynamic range (HDR) image capture through exposure bracketing 
(Figurel.5b) (see Section 10.2 and Mann and Picard 1995; Debevec and Malik 1997), were 
re-christened as computational photography (see Chapter 10) to acknowledge the increased 
use of such techniques in everyday digital photography. For example, the rapid adoption of 
exposure bracketing to create high dynamic range images necessitated the development of 
tone mapping algorithms (Figure 1.10c) (see Section 10.2.1) to convert such images back 
to displayable results (Fattal, Lischinski, and Werman 2002; Durand and Dorsey 2002; Rein- 
hard, Stark, Shirley et al. 2002; Lischinski, Farbman, Uyttendaele et al. 2006a). In addition to 
merging multiple exposures, techniques were developed to merge flash images with non-flash 
counterparts (Eisemann and Durand 2004; Petschnigg, Agrawala, Hoppe et al. 2004) and to 
interactively or automatically select different regions from overlapping images (Agarwala, 
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Dontcheva, Agrawala et al. 2004). 

Texture synthesis (Figure l.lOd) (see Section 10.5), quilting (Efros and Leung 1999; Efros 
and Freeman 2001; Kwatra, Schodl, Essa et al. 2003) and inpainting (Bertalmio, Sapiro, 
Caselles et al. 2000; Bertalmio, Vese, Sapiro et al. 2003; Criminisi, Perez, and Toyama 2004) 
are additional topics that can be classified as computational photography techniques, since 
they re-combine input image samples to produce new photographs. 

A second notable trend during this past decade has been the emergence of feature-based 
techniques (combined with learning) for object recognition (see Section 14.3 and Ponce, 
Hebert, Schmid et al. 2006). Some of the notable papers in this area include the constellation 
model of Fergus, Perona, and Zisserman (2007) (Figure l.lOe) and the pictorial structures 
of Felzenszwalb and Huttenlocher (2005). Feature-based techniques also dominate other 
recognition tasks, such as scene recognition (Zhang, Marszalek, Lazebnik et al. 2007) and 
panorama and location recognition (Brown and Lowe 2007; Schindler, Brown, and Szeliski 
2007). And while interest point (patch-based) features tend to dominate current research, 
some groups are pursuing recognition based on contours (Belongie, Malik, and Puzicha 2002) 
and region segmentation (Figure l.lOf) (Mori, Ren, Efros et al. 2004). 

Another significant trend from this past decade has been the development of more efficient 
algorithms for complex global optimization problems (see Sections 3.7 and B.5 and Szeliski, 
Zabih, Scharstein et al. 2008; Blake, Kohli, and Rother 2010). While this trend began with 
work on graph cuts (Boykov, Veksler, and Zabih 2001 ; Kohli and Torr 2007), a lot of progress 
has also been made in message passing algorithms, such as loopy belief propagation (LBP) 
(Yedidia, Freeman, and Weiss 2001; Kumar and Torr 2006). 

The final trend, which now dominates a lot of the visual recognition research in our com- 
munity, is the application of sophisticated machine learning techniques to computer vision 
problems (see Section 14.5.1 and Freeman, Perona, and Scholkopf 2008). This trend coin- 
cides with the increased availability of immense quantities of partially labelled data on the 
Internet, which makes it more feasible to learn object categories without the use of careful 
human supervision. 


1.3 Book overview 

In the final part of this introduction, I give a brief tour of the material in this book, as well 
as a few notes on notation and some additional general references. Since computer vision is 
such a broad field, it is possible to study certain aspects of it, e.g., geometric image formation 
and 3D structure recovery, without engaging other parts, e.g., the modeling of reflectance and 
shading. Some of the chapters in this book are only loosely coupled with others, and it is not 
strictly necessary to read all of the material in sequence. 


20 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



image processing 


Images (2D) 


vision 

*■ Geometry (3D) 

( shape 

graphics 


Photometry 

appearance 



Figure 1.11 Relationship between images, geometry, and photometry, as well as a taxonomy 
of the topics covered in this book. Topics are roughly positioned along the left-right axis 
depending on whether they are more closely related to image-based (left), geometry-based 
(middle) or appearance-based (right) representations, and on the vertical axis by increasing 
level of abstraction. The whole figure should be taken with a large grain of salt, as there are 
many additional subtle connections between topics not illustrated here. 
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Figure 1.11 shows a rough layout of the contents of this book. Since computer vision 
involves going from images to a structural description of the scene (and computer graphics 
the converse), I have positioned the chapters horizontally in terms of which major component 
they address, in addition to vertically according to their dependence. 

Going from left to right, we see the major column headings as Images (which are 2D 
in nature). Geometry (which encompasses 3D descriptions), and Photometry (which encom- 
passes object appearance). (An alternative labeling for these latter two could also be shape 
and appearance — see, e.g.. Chapter 13 and Kang, Szeliski, and Anandan (2000).) Going 
from top to bottom, we see increasing levels of modeling and abstraction, as well as tech- 
niques that build on previously developed algorithms. Of course, this taxonomy should be 
taken with a large grain of salt, as the processing and dependencies in this diagram are not 
strictly sequential and subtle additional dependencies and relationships also exist (e.g., some 
recognition techniques make use of 3D information). The placement of topics along the hor- 
izontal axis should also be taken lightly, as most vision algorithms involve mapping between 
at least two different representations. 9 

Interspersed throughout the book are sample applications, which relate the algorithms 
and mathematical material being presented in various chapters to useful, real-world applica- 
tions. Many of these applications are also presented in the exercises sections, so that students 
can write their own. 

At the end of each section, I provide a set of exercises that the students can use to imple- 
ment, test, and refine the algorithms and techniques presented in each section. Some of the 
exercises are suitable as written homework assignments, others as shorter one-week projects, 
and still others as open-ended research problems that make for challenging final projects. 
Motivated students who implement a reasonable subset of these exercises will, by the end of 
the book, have a computer vision software library that can be used for a variety of interesting 
tasks and projects. 

As a reference book, I try wherever possible to discuss which techniques and algorithms 
work well in practice, as well as providing up-to-date pointers to the latest research results in 
the areas that I cover. The exercises can be used to build up your own personal library of self- 
tested and validated vision algorithms, which is more worthwhile in the long term (assuming 
you have the time) than simply pulling algorithms out of a library whose performance you do 
not really understand. 

The book begins in Chapter 2 with a review of the image formation processes that create 
the images that we see and capture. Understanding this process is fundamental if you want 
to take a scientific (model-based) approach to computer vision. Students who are eager to 
just start implementing algorithms (or courses that have limited time) can skip ahead to the 

9 For an interesting comparison with what is known about the human visual system, e.g., the largely parallel what 
and where pathways, see some textbooks on human perception (Palmer 1999; Livingstone 2008). 
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Figure 1.12 A pictorial summary of the chapter contents. Sources: Brown, Szeliski, and 
Winder (2005); Comaniciu and Meer (2002); Snavely, Seitz, and Szeliski (2006); Nagel 
and Enkelmann (1986); Szeliski and Shum (1997); Debevec and Malik (1997); Gortler, 
Grzeszczuk, Szeliski el al. (1996); Viola and Jones (2004) — see the figures in the respec- 
tive chapters for copyright information. 






1.3 Book overview 


23 


next chapter and dip into this material later. In Chapter 2, we break down image formation 
into three major components. Geometric image formation (Section 2.1) deals with points, 
lines, and planes, and how these are mapped onto images using projective geometry and other 
models (including radial lens distortion). Photometric image formation (Section 2.2) covers 
radiometry, which describes how light interacts with surfaces in the world, and optics, which 
projects light onto the sensor plane. Finally, Section 2.3 covers how sensors work, including 
topics such as sampling and aliasing, color sensing, and in-camera compression. 

Chapter 3 covers image processing, which is needed in almost all computer vision appli- 
cations. This includes topics such as linear and non-linear filtering (Section 3.3), the Fourier 
transform (Section 3.4), image pyramids and wavelets (Section 3.5), geometric transforma- 
tions such as image warping (Section 3.6), and global optimization techniques such as regu- 
larization and Markov Random Fields (MRFs) (Section 3.7). While most of this material is 
covered in courses and textbooks on image processing, the use of optimization techniques is 
more typically associated with computer vision (although MRFs are now being widely used 
in image processing as well). The section on MRFs is also the first introduction to the use 
of Bayesian inference techniques, which are covered at a more abstract level in Appendix B. 
Chapter 3 also presents applications such as seamless image blending and image restoration. 

In Chapter 4, we cover feature detection and matching. A lot of current 3D reconstruction 
and recognition techniques are built on extracting and matching feature points (Section 4.1), 
so this is a fundamental technique required by many subsequent chapters (Chapters 6, 7, 9 
and 14). We also cover edge and straight line detection in Sections 4.2 and 4.3. 

Chapter 5 covers region segmentation techniques, including active contour detection and 
tracking (Section 5.1). Segmentation techniques include top-down (split) and bottom-up 
(merge) techniques, mean shift techniques that find modes of clusters, and various graph- 
based segmentation approaches. All of these techniques are essential building blocks that are 
widely used in a variety of applications, including performance-driven animation, interactive 
image editing, and recognition. 

In Chapter 6, we cover geometric alignment and camera calibration. We introduce the 
basic techniques of feature-based alignment in Section 6.1 and show how this problem can 
be solved using either linear or non-linear least squares, depending on the motion involved. 
We also introduce additional concepts, such as uncertainty weighting and robust regression, 
which are essential to making real-world systems work. Feature-based alignment is then used 
as a building block for 3D pose estimation ( extrinsic calibration) in Section 6.2 and camera 
( intrinsic ) calibration in Section 6.3. Chapter 6 also describes applications of these techniques 
to photo alignment for flip-book animations, 3D pose estimation from a hand-held camera, 
and single-view reconstruction of building models. 

Chapter 7 covers the topic of structure from motion, which involves the simultaneous 
recovery of 3D camera motion and 3D scene structure from a collection of tracked 2D fea- 
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tures. This chapter begins with the easier problem of 3D point triangulation (Section 7.1), 
which is the 3D reconstruction of points from matched features when the camera positions 
are known. It then describes two-frame structure from motion (Section 7.2), for which al- 
gebraic techniques exist, as well as robust sampling techniques such as RANSAC that can 
discount erroneous feature matches. The second half of Chapter 7 describes techniques for 
multi-frame structure from motion, including factorization (Section 7.3), bundle adjustment 
(Section 7.4), and constrained motion and structure models (Section 7.5). It also presents 
applications in view morphing, sparse 3D model construction, and match move. 

In Chapter 8, we go back to a topic that deals directly with image intensities (as op- 
posed to feature tracks), namely dense intensity -based motion estimation (optical flow). We 
start with the simplest possible motion models, translational motion (Section 8.1), and cover 
topics such as hierarchical (coarse-to-fine) motion estimation, Fourier-based techniques, and 
iterative refinement. We then present parametric motion models, which can be used to com- 
pensate for camera rotation and zooming, as well as affine or planar perspective motion (Sec- 
tion 8.2). This is then generalized to spline-based motion models (Section 8.3) and finally 
to general per-pixel optical flow (Section 8.4), including layered and learned motion models 
(Section 8.5). Applications of these techniques include automated morphing, frame interpo- 
lation (slow motion), and motion-based user interfaces. 

Chapter 9 is devoted to image stitching, i.e., the construction of large panoramas and com- 
posites. While stitching is just one example of computation photography (see Chapter 10), 
there is enough depth here to warrant a separate chapter. We start by discussing various pos- 
sible motion models (Section 9.1), including planar motion and pure camera rotation. We 
then discuss global alignment (Section 9.2), which is a special (simplified) case of general 
bundle adjustment, and then present panorama recognition , i.e., techniques for automatically 
discovering which images actually form overlapping panoramas. Finally, we cover the topics 
of image compositing and blending (Section 9.3), which involve both selecting which pixels 
from which images to use and blending them together so as to disguise exposure differences. 

Image stitching is a wonderful application that ties together most of the material covered 
in earlier parts of this book. It also makes for a good mid-term course project that can build 
on previously developed techniques such as image warping and feature detection and match- 
ing. Chapter 9 also presents more specialized variants of stitching such as whiteboard and 
document scanning, video summarization, panography, full 360° spherical panoramas, and 
interactive photomontage for blending repeated action shots together. 

Chapter 10 presents additional examples of computational photography, which is the pro- 
cess of creating new images from one or more input photographs, often based on the careful 
modeling and calibration of the image formation process (Section 10.1). Computational pho- 
tography techniques include merging multiple exposures to create high dynamic range images 
(Section 10.2), increasing image resolution through blur removal and super-resolution (Sec- 
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tion 10.3), and image editing and compositing operations (Section 10.4). We also cover the 
topics of texture analysis, synthesis and inpainting (hole filling) in Section 10.5, as well as 
non-photorealistic rendering (Section 10.5.2). 

In Chapter 11, we turn to the issue of stereo correspondence, which can be thought of 
as a special case of motion estimation where the camera positions are already known (Sec- 
tion 11.1). This additional knowledge enables stereo algorithms to search over a much smaller 
space of correspondences and, in many cases, to produce dense depth estimates that can 
be converted into visible surface models (Section 11.3). We also cover multi-view stereo 
algorithms that build a true 3D surface representation instead of just a single depth map 
(Section 11.6). Applications of stereo matching include head and gaze tracking, as well as 
depth-based background replacement (Z-keying). 

Chapter 12 covers additional 3D shape and appearance modeling techniques. These in- 
clude classic shape-from-X techniques such as shape from shading, shape from texture, and 
shape from focus (Section 12.1), as well as shape from smooth occluding contours (Sec- 
tion 11.2.1) and silhouettes (Section 12.5). An alternative to all of these passive computer 
vision techniques is to use active rangefinding (Section 12.2), i.e., to project patterned light 
onto scenes and recover the 3D geometry through triangulation. Processing all of these 3D 
representations often involves interpolating or simplifying the geometry (Section 12.3), or 
using alternative representations such as surface point sets (Section 12.4). 

The collection of techniques for going from one or more images to partial or full 3D 
models is often called image-based modeling or 3D photography. Section 12.6 examines 
three more specialized application areas (architecture, faces, and human bodies), which can 
use model-based reconstruction to fit parameterized models to the sensed data. Section 12.7 
examines the topic of appearance modeling, i.e., techniques for estimating the texture maps, 
albedos, or even sometimes complete bi-directional reflectance distribution functions (BRDFs) 
that describe the appearance of 3D surfaces. 

In Chapter 13, we discuss the large number of image-based rendering techniques that 
have been developed in the last two decades, including simpler techniques such as view in- 
terpolation (Section 13.1), layered depth images (Section 13.2), and sprites and layers (Sec- 
tion 13.2.1), as well as the more general framework of light fields and Lumigraphs (Sec- 
tion 13.3) and higher-order fields such as environment mattes (Section 13.4). Applications of 
these techniques include navigating 3D collections of photographs using photo tourism and 
viewing 3D models as object movies. 

In Chapter 13, we also discuss video-based rendering, which is the temporal extension of 
image-based rendering. The topics we cover include video-based animation (Section 13.5.1), 
periodic video turned into video textures (Section 13.5.2), and 3D video constructed from 
multiple video streams (Section 13.5.4). Applications of these techniques include video de- 
noising, morphing, and tours based on 360° video. 
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Week 

Material 

Project 

(1.) 

Chapter 2 Image formation 


2. 

Chapter 3 Image processing 


3. 

Chapter 4 Feature detection and matching 

PI 

4. 

Chapter 6 Feature-based alignment 


5. 

Chapter 9 Image stitching 

P2 

6. 

Chapter 8 Dense motion estimation 


7. 

Chapter 7 Structure from motion 

PP 

8. 

Chapter 14 Recognition 


(9.) 

Chapter 10 Computational photography 


10. 

Chapter 1 1 Stereo correspondence 


(11.) 

Chapter 12 3D reconstruction 


12. 

Chapter 13 Image-based rendering 


13. 

Final project presentations 

FP 


Table 1.1 Sample syllabi for 10-week and 13-week courses. The weeks in parentheses are 
not used in the shorter version. PI and P2 are two early-term mini-projects, PP is when the 
(student-selected) final project proposals are due, and FP is the final project presentations. 


Chapter 14 describes different approaches to recognition. It begins with techniques for 
detecting and recognizing faces (Sections 14.1 and 14.2), then looks at techniques for finding 
and recognizing particular objects (instance recognition ) in Section 14.3. Next, we cover the 
most difficult variant of recognition, namely the recognition of broad categories , such as cars, 
motorcycles, horses and other animals (Section 14.4), and the role that scene context plays in 
recognition (Section 14.5). 

To support the book’s use as a textbook, the appendices and associated Web site contain 
more detailed mathematical topics and additional material. Appendix A covers linear algebra 
and numerical techniques, including matrix algebra, least squares, and iterative techniques. 
Appendix B covers Bayesian estimation theory, including maximum likelihood estimation, 
robust statistics, Markov random fields, and uncertainty modeling. Appendix C describes the 
supplementary material available to complement this book, including images and data sets, 
pointers to software, course slides, and an on-line bibliography. 


1.4 Sample syllabus 

Teaching all of the material covered in this book in a single quarter or semester course is a 
Herculean task and likely one not worth attempting. It is better to simply pick and choose 


1.5 A note on notation 


27 


topics related to the lecturer’s preferred emphasis and tailored to the set of mini-projects 
envisioned for the students. 

Steve Seitz and I have successfully used a 10-week syllabus similar to the one shown in 
Table 1.1 (omitting the parenthesized weeks) as both an undergraduate and a graduate-level 
course in computer vision. The undergraduate course 10 tends to go lighter on the mathematics 
and takes more time reviewing basics, while the graduate-level course * 11 dives more deeply 
into techniques and assumes the students already have a decent grounding in either vision 
or related mathematical techniques. (See also the Introduction to Computer Vision course at 
Stanford, 12 which uses a similar curriculum.) Related courses have also been taught on the 
topics of 3D photography 13 and computational photography. 14 

When Steve and I teach the course, we prefer to give the students several small program- 
ming projects early in the course rather than focusing on written homework or quizzes. With 
a suitable choice of topics, it is possible for these projects to build on each other. For exam- 
ple, introducing feature matching early on can be used in a second assignment to do image 
alignment and stitching. Alternatively, direct (optical flow) techniques can be used to do the 
alignment and more focus can be put on either graph cut seam selection or multi-resolution 
blending techniques. 

We also ask the students to propose a final project (we provide a set of suggested topics 
for those who need ideas) by the middle of the course and reserve the last week of the class 
for student presentations. With any luck, some of these final projects can actually turn into 
conference submissions! 

No matter how you decide to structure the course or how you choose to use this book, I 
encourage you to try at least a few small programming tasks to get a good feel for how vision 
techniques work, and when they do not. Better yet, pick topics that are fun and can be used on 
your own photographs, and try to push your creative boundaries to come up with surprising 
results. 


1.5 A note on notation 

For better or worse, the notation found in computer vision and multi- view geometry textbooks 
tends to vary all over the map (Faugeras 1993; Hartley and Zisserman 2004; Girod, Greiner, 
and Niemann 2000; Faugeras and Luong 2001; Forsyth and Ponce 2003). In this book, I 
use the convention I first learned in my high school physics class (and later multi-variate 

10 http://www.cs.washington.edu/education/courses/455/ 

1 1 http://www.cs.washington.edu/education/courses/576/ 

12 http://vision. stanford.edu/teaching/cs223b/ 

13 http://www.cs.washington.edu/education/courses/558/06sp/ 

14 http ://graphic s . c s .emu . edu/courses/ 15-463/ 
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calculus and computer graphics courses), which is that vectors v are lower case bold, matrices 
M are upper case bold, and scalars (T, s) are mixed case italic. Unless otherwise noted, 
vectors operate as column vectors, i.e., they post-multiply matrices, Mv, although they are 
sometimes written as comma-separated parenthesized lists x = (x. y) instead of bracketed 
column vectors x = [x y] T . Some commonly used matrices are R for rotations, K for 
calibration matrices, and I for the identity matrix. Homogeneous coordinates (Section 2.1) 
are denoted with a tilde over the vector, e.g., x = (x, y , w) = w(x , y 1 1) = wx in V 2 . The 
cross product operator in matrix form is denoted by [ ] . 

1.6 Additional reading 

This book attempts to be self-contained, so that students can implement the basic assignments 
and algorithms described here without the need for outside references. However, it does pre- 
suppose a general familiarity with basic concepts in linear algebra and numerical techniques, 
which are reviewed in Appendix A, and image processing, which is reviewed in Chapter 3. 

Students who want to delve more deeply into these topics can look in (Golub and Van 
Loan 1996) for matrix algebra and (Strang 1988) for linear algebra. In image processing, 
there are a number of popular textbooks, including (Crane 1997; Gomes and Velho 1997; 
lahne 1997; Pratt 2007; Russ 2007; Burger and Burge 2008; Gonzales and Woods 2008). For 
computer graphics, popular texts include (Foley, van Dam, Feiner et al. 1995; Watt 1995), 
with (Glassner 1995) providing a more in-depth look at image formation and rendering. For 
statistics and machine learning, Chris Bishop’s (2006) book is a wonderful and comprehen- 
sive introduction with a wealth of exercises. Students may also want to look in other textbooks 
on computer vision for material that we do not cover here, as well as for additional project 
ideas (Ballard and Brown 1982; Faugeras 1993; Nalwa 1993; Trucco and Verri 1998; Forsyth 
and Ponce 2003). 

There is, however, no substitute for reading the latest research literature, both for the lat- 
est ideas and techniques and for the most up-to-date references to related literature. 15 In this 
book, I have attempted to cite the most recent work in each field so that students can read them 
directly and use them as inspiration for their own work. Browsing the last few years’ con- 
ference proceedings from the major vision and graphics conferences, such as CVPR, ECCV, 
ICCV, and SIGGRAPH, will provide a wealth of new ideas. The tutorials offered at these 
conferences, for which slides or notes are often available on-line, are also an invaluable re- 
source. 


15 For a comprehensive bibliography and taxonomy of computer vision research, Keith Price’s Annotated Com- 
puter Vision Bibliography http://www.visionbib.com/bibliography/contents.html is an invaluable resource. 


Chapter 2 


Image formation 


2.1 Geometric primitives and transformations 31 

2.1.1 Geometric primitives 32 

2.1.2 2D transformations 35 

2.1.3 3D transformations 39 

2.1.4 3D rotations 41 

2.1.5 3D to 2D projections 46 

2.1.6 Lens distortions 58 

2.2 Photometric image formation 60 

2.2.1 Lighting 60 

2.2.2 Reflectance and shading 62 

2.2.3 Optics 68 

2.3 The digital camera 73 

2.3.1 Sampling and aliasing 77 

2.3.2 Color 80 

2.3.3 Compression 90 

2.4 Additional reading 93 

2.5 Exercises 93 


30 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 





G 

R 

G 

R 

B 

G 

B 

G 

G 

R 

G 

R 

B 

G 

B 

G 


(d) 


Figure 2.1 A few components of the image formation process: (a) perspective projection; 
(b) light scattering when hitting a surface; (c) lens optics; (d) Bayer color filter array. 
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Before we can intelligently analyze and manipulate images, we need to establish a vocabulary 
for describing the geometry of a scene. We also need to understand the image formation 
process that produced a particular image given a set of lighting conditions, scene geometry, 
surface properties, and camera optics. In this chapter, we present a simplified model of such 
an image formation process. 

Section 2.1 introduces the basic geometric primitives used throughout the book (points, 
lines, and planes) and the geometric transformations that project these 3D quantities into 2D 
image features (Figure 2.1a). Section 2.2 describes how lighting, surface properties (Fig- 
ure 2.1b), and camera optics (Figure 2.1c) interact in order to produce the color values that 
fall onto the image sensor. Section 2.3 describes how continuous color images are turned into 
discrete digital samples inside the image sensor (Figure 2. Id) and how to avoid (or at least 
characterize) sampling deficiencies, such as aliasing. 

The material covered in this chapter is but a brief summary of a very rich and deep set of 
topics, traditionally covered in a number of separate fields. A more thorough introduction to 
the geometry of points, lines, planes, and projections can be found in textbooks on multi-view 
geometry (Hartley and Zisserman 2004; Faugeras and Luong 2001) and computer graphics 
(Foley, van Dam, Feiner el al. 1995). The image formation (synthesis) process is traditionally 
taught as part of a computer graphics curriculum (Foley, van Dam, Feiner et al. 1995; Glass- 
ner 1995; Watt 1995; Shirley 2005) but it is also studied in physics-based computer vision 
(Wolff, Shafer, and Healey 1992a). The behavior of camera lens systems is studied in optics 
(Moller 1988; Hecht 2001; Ray 2002). Two good books on color theory are (Wyszecki and 
Stiles 2000; Healey and Shafer 1992), with (Livingstone 2008) providing a more fun and in- 
formal introduction to the topic of color perception. Topics relating to sampling and aliasing 
are covered in textbooks on signal and image processing (Crane 1997; Jahne 1997; Oppen- 
heim and Schafer 1996; Oppenheim, Schafer, and Buck 1999; Pratt 2007; Russ 2007; Burger 
and Burge 2008; Gonzales and Woods 2008). 

A note to students: If you have already studied computer graphics, you may want to 
skim the material in Section 2.1, although the sections on projective depth and object-centered 
projection near the end of Section 2.1.5 may be new to you. Similarly, physics students (as 
well as computer graphics students) will mostly be familiar with Section 2.2. Finally, students 
with a good background in image processing will already be familiar with sampling issues 
(Section 2.3) as well as some of the material in Chapter 3. 


2.1 Geometric primitives and transformations 

In this section, we introduce the basic 2D and 3D primitives used in this textbook, namely 
points, lines, and planes. We also describe how 3D features are projected into 2D features. 
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More detailed descriptions of these topics (along with a gentler and more intuitive introduc- 
tion) can be found in textbooks on multiple-view geometry (Hartley and Zisserman 2004; 
Faugeras and Luong 2001). 

2.1.1 Geometric primitives 

Geometric primitives form the basic building blocks used to describe three-dimensional shapes. 
In this section, we introduce points, lines, and planes. Later sections of the book discuss 
curves (Sections 5.1 and 11.2), surfaces (Section 12.3), and volumes (Section 12.5). 

2D points. 2D points (pixel coordinates in an image) can be denoted using a pair of values, 
x = (x, y) £ 7Z 2 , or alternatively. 


(As stated in the introduction, we use the (x\,X 2 , ■ ■ ■) notation to denote column vectors.) 

2D points can also be represented using homogeneous coordinates , x = (x, y, ui) £ V 2 , 
where vectors that differ only by scale are considered to be equivalent. V 2 = 7 Z 3 — (0, 0, 0) 
is called the 2D projective space. 

A homogeneous vector x can be converted back into an inhomogeneous vector x by 
dividing through by the last element w, i.e., 

x = (x,y,w) = w(x,y, 1) = wx, (2.2) 

where x = (x, y. 1) is the augmented vector. Homogeneous points whose last element i s w = 
0 are called ideal points or points at infinity and do not have an equivalent inhomogeneous 
representation. 

2D lines. 2D lines can also be represented using homogeneous coordinates l = (a. b. c). 
The corresponding line equation is 

x ■ 1 = ax + by + c = 0. (2.3) 

We can normalize the line equation vector so that l = ( h x , n y , d) = (ft, d) with ||n|| = 1. In 
this case, ft is the normal vector perpendicular to the line and d is its distance to the origin 
(Figure 2.2). (The one exception to this normalization is the line at infinity l = (0,0, 1), 
which includes all (ideal) points at infinity.) 

We can also express ft as a function of rotation angle 9, ft = ( n x ,n y ) = (cos 9, sin 9) 
(Figure 2.2a). This representation is commonly used in the Hough transform line-finding 
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Figure 2.2 (a) 2D line equation and (b) 3D plane equation, expressed in terms of the normal 
n and distance to the origin d. 


algorithm, which is discussed in Section 4.3.2. The combination ( 9 , d) is also known as 
polar coordinates. 

When using homogeneous coordinates, we can compute the intersection of two lines as 

x = h x l 2 , (2.4) 

where x is the cross product operator. Similarly, the line joining two points can be written as 

l = X\ x X 2 - (2.5) 

When trying to fit an intersection point to multiple lines or, conversely, a line to multiple 
points, least squares techniques (Section 6.1.1 and Appendix A. 2) can be used, as discussed 
in Exercise 2.1. 

2D conics. There are other algebraic curves that can be expressed with simple polynomial 
homogeneous equations. For example, the conic sections (so called because they arise as the 
intersection of a plane and a 3D cone) can be written using a quadric equation 

x t Qx = 0 . ( 2 . 6 ) 

Quadric equations play useful roles in the study of multi-view geometry and camera calibra- 
tion (Hartley and Zisserman 2004; Faugeras and Luong 2001) but are not used extensively in 
this book. 

3D points. Point coordinates in three dimensions can be written using inhomogeneous co- 
ordinates x = (x, y, z ) G T? 3 or homogeneous coordinates x = ( x , y, z, w) G V 3 . As before, 
it is sometimes useful to denote a 3D point using the augmented vector x = (x. y. z, 1) with 
X = wx. 
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Figure 2.3 3D line equation, r = (1 — A )p + Xq. 

3D planes. 3D planes can also be represented as homogeneous coordinates rh = (a, b. c, d) 
with a corresponding plane equation 

x ■ rh = ax + by + cz + d = 0. (2.7) 

We can also normalize the plane equation as m = (h x , n y , n z ,d) = (n, d) with ||n|| = 1. 
In this case, h is the normal vector perpendicular to the plane and d is its distance to the 
origin (Figure 2.2b). As with the case of 2D lines, the plane at infinity rh = (0,0,0, 1), 
which contains all the points at infinity, cannot be normalized (i.e., it does not have a unique 
normal or a finite distance). 

We can express h as a function of two angles ( 9 , 6), 

h = (cos 6 cos (f>, sin 9 cos (j>, sin <j>), (2.8) 

i.e., using spherical coordinates, but these are less commonly used than polar coordinates 
since they do not uniformly sample the space of possible normal vectors. 

3D lines. Lines in 3D are less elegant than either lines in 2D or planes in 3D. One possible 
representation is to use two points on the line, (p, q). Any other point on the line can be 
expressed as a linear combination of these two points 

r = (l-X)p + Xq, (2.9) 

as shown in Figure 2.3. If we restrict 0 < A < 1, we get the line segment joining p and q. 

If we use homogeneous coordinates, we can write the line as 

f = pp + Xq. (2.10) 

A special case of this is when the second point is at infinity, i.e., q = (d x , d yi d z , 0) = (d, 0). 
Here, we see that d is the direction of the line. We can then re-write the inhomogeneous 3D 
line equation as 


r = p + Ad. 


( 2 . 11 ) 


2. 1 Geometric primitives and transformations 


35 


A disadvantage of the endpoint representation for 3D lines is that it has too many degrees 
of freedom, i.e., six (three for each endpoint) instead of the four degrees that a 3D line truly 
has. However, if we fix the two points on the line to lie in specific planes, we obtain a rep- 
resentation with four degrees of freedom. For example, if we are representing nearly vertical 
lines, then z = 0 and z = 1 form two suitable planes, i.e., the (. x , y) coordinates in both 
planes provide the four coordinates describing the line. This kind of two-plane parameteri- 
zation is used in the light field and Lumigraph image-based rendering systems described in 
Chapter 13 to represent the collection of rays seen by a camera as it moves in front of an 
object. The two-endpoint representation is also useful for representing line segments, even 
when their exact endpoints cannot be seen (only guessed at). 

If we wish to represent all possible lines without bias towards any particular orientation, 
we can use Pliicker coordinates (Hartley and Zisserman 2004, Chapter 2; Faugeras and Luong 
2001, Chapter 3). These coordinates are the six independent non-zero entries in the 4x4 skew 
symmetric matrix 

L = pq T — qp T , ( 2 . 12 ) 

where p and q are any two (non-identical) points on the line. This representation has only 
four degrees of freedom, since L is homogeneous and also satisfies det(L) = 0, which results 
in a quadratic constraint on the Pliicker coordinates. 

In practice, the minimal representation is not essential for most applications. An ade- 
quate model of 3D lines can be obtained by estimating their direction (which may be known 
ahead of time, e.g., for architecture) and some point within the visible portion of the line 
(see Section 7.5.1) or by using the two endpoints, since lines are most often visible as finite 
line segments. However, if you are interested in more details about the topic of minimal 
line parameterizations, Forstner (2005) discusses various ways to infer and model 3D lines in 
projective geometry, as well as how to estimate the uncertainty in such fitted models. 

3D quadrics. The 3D analog of a conic section is a quadric surface 

x t Qx = 0 (2.13) 

(Hartley and Zisserman 2004, Chapter 2). Again, while quadric surfaces are useful in the 
study of multi-view geometry and can also serve as useful modeling primitives (spheres, 
ellipsoids, cylinders), we do not study them in great detail in this book. 

2.1.2 2D transformations 

Having defined our basic primitives, we can now turn our attention to how they can be trans- 
formed. The simplest transformations occur in the 2D plane and are illustrated in Figure 2.4. 
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Figure 2.4 Basic set of 2D planar transformations. 


Translation. 2D translations can be written as x' = x + t or 


I t x 


where I is the (2 x 2) identity matrix or 


x 


(2.14) 


(2.15) 


where 0 is the zero vector. Using a 2 x 3 matrix results in a more compact notation, whereas 
using a full-rank 3x3 matrix (which can be obtained from the 2x3 matrix by appending a 
|0 7 1] row) makes it possible to chain transformations using matrix multiplication. Note that 
in any equation where an augmented vector such as x appears on both sides, it can always be 
replaced with a full homogeneous vector x. 


Rotation + translation. This transformation is also known as 2D rigid body motion or the 
2D Euclidean transformation (since Euclidean distances are preserved). It can be written as 
x' = Rx + t or 


R t 


x 


(2.16) 


where 


R = 


cos 0 — sin 8 

sin 8 cos 9 


(2.17) 


is an orthonormal rotation matrix with RR T = I and \R\ = 1. 


Scaled rotation. Also known as the similarity transform, this transformation can be ex- 
pressed as x' - sRx + t where s is an arbitrary scale factor. It can also be written as 




sR t 

X = 


b t x 

a ty 


x, 


(2.18) 


where we no longer require that a 2 + b 2 = 1. The similarity transform preserves angles 
between lines. 
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Affine. The affine transformation is written as x' = Ax, where A is an arbitrary 2x3 
matrix, i.e.. 


«00 

o 01 

a 02 

«10 

Oil 

«12 


(2.19) 


Parallel lines remain parallel under affine transformations. 


Projective. This transformation, also known as a perspective transform or homography, 
operates on homogeneous coordinates, 

x = Hx , (2.20) 

where H is an arbitrary 3x3 matrix. Note that H is homogeneous, i.e., it is only defined 
up to a scale, and that two H matrices that differ only by scale are equivalent. The resulting 
homogeneous coordinate x' must be normalized in order to obtain an inhomogeneous result 
x, i.e., 

/ h Q0 x + h m y + h 0 2 , , h w x + huy + h 12 ..... 

x = — and y = — . (2.21) 

h 2 oX + h 2 iy + h 22 h 20 x + h 21 y + h 22 

Perspective transformations preserve straight lines (i.e., they remain straight after the trans- 
formation). 


Hierarchy of 2D transformations. The preceding set of transformations are illustrated 
in Figure 2.4 and summarized in Table 2.1. The easiest way to think of them is as a set 
of (potentially restricted) 3x3 matrices operating on 2D homogeneous coordinate vectors. 
Hartley and Zisserman (2004) contains a more detailed description of the hierarchy of 2D 
planar transformations. 

The above transformations form a nested set of groups, i.e., they are closed under com- 
position and have an inverse that is a member of the same group. (This will be important 
later when applying these transformations to images in Section 3.6.) Each (simpler) group is 
a subset of the more complex group below it. 

Co-vectors. While the above transformations can be used to transform points in a 2D plane, 
can they also be used directly to transform a line equation? Consider the homogeneous equa- 
tion l ■ x = 0. If we transform x' = Hx, we obtain 

l ■ x = l T Hx = ( H T l ) T x = l-x = 0, (2.22) 

-/ T- 

i.e., I = H l. Thus, the action of a projective transformation on a co-vector such as a 2D 
line or 3D normal can be represented by the transposed inverse of the matrix, which is equiv- 
alent to the adjoint of H, since projective transformation matrices are homogeneous. Jim 
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Transformation 

Matrix 

# DoF 

Preserves 

translation 

I * 

2x3 

2 

orientation 

rigid (Euclidean) 

R | t 

2x3 

3 

lengths 

similarity 

sR t 

2x3 

4 

angles 

affine 

A 


6 

parallelism 


L J 2x3 


Icon 



projective 



J 3x3 


8 


straight lines 



Table 2.1 Hierarchy of 2 D coordinate transformations. Each transformation also preserves 
the properties listed in the rows below it, i.e., similarity preserves not only angles but also 
parallelism and straight lines. The 2x3 matrices are extended with a third [0 T 1] row to form 
a full 3x3 matrix for homogeneous coordinate transformations. 


Blinn (1998) describes (in Chapters 9 and 10) the ins and outs of notating and manipulating 
co-vectors. 

While the above transformations are the ones we use most extensively, a number of addi- 
tional transformations are sometimes used. 

Stretch/squash. This transformation changes the aspect ratio of an image, 

X — S x X tx 

y' = s v y + t v , 

and is a restricted form of an affine transformation. Unfortunately, it does not nest cleanly 
with the groups listed in Table 2. 1 . 

Planar surface flow. This eight-parameter transformation (Horn 1986; Bergen, Anandan, 
Hanna et al. 1992; Girod, Greiner, and Niemann 2000), 

x' = do + a\X + a.2 y + a qX 2 + a^xy 
y' = 03 + 04 x + a 5 y + a^x 2 + a^xy, 

arises when a planar surface undergoes a small 3D motion. It can thus be thought of as a 
small motion approximation to a full homography. Its main attraction is that it is linear in the 
motion parameters, a^, which are often the quantities being estimated. 
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Transformation 

Matrix 

# DoF 

Preserves 

translation 

I * 

3x4 

3 

orientation 

rigid (Euclidean) 

R | t 

3x4 

6 

lengths 

similarity 

sR t 

3x4 

7 

angles 

affine 

A 


12 

parallelism 


J 3x4 


Icon 



projective 



J 4x4 


15 


straight lines 



Table 2.2 Hierarchy of 3D coordinate transformations. Each transformation also preserves 
the properties listed in the rows below it, i.e., similarity preserves not only angles but also 
parallelism and straight lines. The 3x4 matrices are extended with a fourth [0 T 1] row to 
form a full 4x4 matrix for homogeneous coordinate transformations. The mnemonic icons 
are drawn in 2D but are meant to suggest transformations occurring in a full 3D cube. 


Bilinear interpolant. This eight-parameter transform (Wolberg 1990), 

x' = a-o + a\x + 02 y + agxy 
y' = a 3 + a±x + a 5 y + a 7 xy, 

can be used to interpolate the deformation due to the motion of the four comer points of 
a square. (In fact, it can interpolate the motion of any four non-collinear points.) While 
the deformation is linear in the motion parameters, it does not generally preserve straight 
lines (only lines parallel to the square axes). However, it is often quite useful, e.g., in the 
interpolation of sparse grids using splines (Section 8.3). 

2.1.3 3D transformations 

The set of three-dimensional coordinate transformations is very similar to that available for 
2D transformations and is summarized in Table 2.2. As in 2D, these transformations form a 
nested set of groups. Hartley and Zisserman (2004, Section 2.4) give a more detailed descrip- 
tion of this hierarchy. 

Translation. 3D translations can be written as x' = x + t or 


x 


(2.23) 
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where I is the (3 x 3) identity matrix and 0 is the zero vector. 


Rotation + translation. Also known as 3D rigid body motion or the 3D Euclidean trans- 
formation, it can be written as x' = Rx + t or 


R t 


x 


(2.24) 


where R is a 3 x 3 orthonormal rotation matrix with RR T = I and \R\ = 1. Note that 
sometimes it is more convenient to describe a rigid motion using 


x' = R(x c) = Rx — Rc , 


(2.25) 


where c is the center of rotation (often the camera center). 

Compactly parameterizing a 3D rotation is a non-trivial task, which we describe in more 
detail below. 


Scaled rotation. The 3D similarity transform can be expressed as x' = sRx + t where s 
is an arbitrary scale factor. It can also be written as 


sR t 


(2.26) 


This transformation preserves angles between lines and planes. 


Affine. 

i.e.. 


The affine transform is written as x' = Ax, where A is an arbitrary 3x4 matrix. 


°00 

«01 

«02 

003 



Oio 

Oil 

«12 

Ol3 

X. 

(2.27) 

020 

«21 

022 

023 




Parallel lines and planes remain parallel under affine transformations. 


Projective. This transformation, variously known as a 3D perspective transform, homogra- 
phy, or collineation, operates on homogeneous coordinates, 

x = Hx , (2.28) 

where H is an arbitrary 4x4 homogeneous matrix. As in 2D, the resulting homogeneous 
coordinate x' must be normalized in order to obtain an inhomogeneous result x. Perspective 
transformations preserve straight lines (i.e., they remain straight after the transformation). 
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A 

n 


u 



Figure 2.5 Rotation around an axis h by an angle 6. 


2.1.4 3D rotations 

The biggest difference between 2D and 3D coordinate transformations is that the parameter- 
ization of the 3D rotation matrix R is not as straightforward but several possibilities exist. 

Euler angles 

A rotation matrix can be formed as the product of three rotations around three cardinal axes, 
e.g., x, y. and z, or x, y, and x. This is generally a bad idea, as the result depends on the 
order in which the transforms are applied. What is worse, it is not always possible to move 
smoothly in the parameter space, i.e., sometimes one or more of the Euler angles change 
dramatically in response to a small change in rotation. 1 For these reasons, we do not even 
give the formula for Euler angles in this book — interested readers can look in other textbooks 
or technical reports (Faugeras 1993; Diebel 2006). Note that, in some applications, if the 
rotations are known to be a set of uni-axial transforms, they can always be represented using 
an explicit set of rigid transformations. 

Axis/angle (exponential twist) 

A rotation can be represented by a rotation axis h and an angle 6 , or equivalently by a 3D 
vector u> = Oh. Figure 2.5 shows how we can compute the equivalent rotation. First, we 
project the vector v onto the axis h to obtain 


which is the component of v that is not affected by the rotation. Next, we compute the 
perpendicular residual of v from h. 



(2.29) 


v± = v — vu = (7 — hh T )v. 


1 In robotics, this is sometimes referred to as gimbal lock. 


(2.30) 
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We can rotate this vector by 90° using the cross product, 

v x = n x v = [h] x v, (2.31) 

where [n] x is the matrix form of the cross product operator with the vector h = (h x , h y , h z ), 

0 —h~ fly 


Uy 

h z 0 —h x 

Tli, fl x 0 


(2.32) 


by I b X 

Note that rotating this vector by another 90° is equivalent to taking the cross product again, 

»xx =fixv x = [h] x v = -v±, 


and hence 

W|| = v - v± = v + v xx = (/ + [n] x )u. 

We can now compute the in-plane component of the rotated vector u as 

uj_ = cos6v± + sin6v x = (sin0[n] x — cos0[n] x )t>. 


Putting all these terms together, we obtain the final rotated vector as 

u = u _ l + i’ll = (I + sin0[n] x + (1 — cos0)[n] x )t>. (2.33) 


We can therefore write the rotation matrix corresponding to a rotation by 9 around an axis h 
as 

R(h, 9) = I + sin0[n] x + (1 — cosd)[n] x , (2.34) 

which is known as Rodriguez’s formula (Ayache 1989). 

The product of the axis h and angle 9, iv = Oh = (ui x ,ui y ,Lu z ), is a minimal represen- 
tation for a 3D rotation. Rotations through common angles such as multiples of 90° can be 
represented exactly (and converted to exact matrices) if 9 is stored in degrees. Unfortunately, 
this representation is not unique, since we can always add a multiple of 360° (27 t radians) to 
9 and get the same rotation matrix. As well, (n, 9) and (— h, -0) represent the same rotation. 

However, for small rotations (e.g., corrections to rotations), this is an excellent choice. 
In particular, for small (infinitesimal or instantaneous) rotations and 9 expressed in radians, 
Rodriguez’s formula simplifies to 


~ I + sin 6[h] x « / + [Oh] x 


1 ~U1 Z UJy 

u z 1 -u x 

UJy UJ X 1 


(2.35) 
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which gives a nice linearized relationship between the rotation parameters iv and It. We can 
also write R(u>)v ss v + u> x v. which is handy when we want to compute the derivative of 
Rv with respect to u>. 


dRv 

duj T 


0 z -y 
—z 0 x 
y — x 0 


(2.36) 


Another way to derive a rotation through a finite angle is called the exponential twist 
(Murray, Li, and Sastry 1994). A rotation by an angle 9 is equivalent to k rotations through 
9/k. In the limit as k — > oo, we obtain 

R{fi , 6) = lim (I + \ [On] x ) k = exp [w] x . (2.37) 

k — >oo K 

If we expand the matrix exponential as a Taylor series (using the identity \h] k / + ‘ A = - \h] k x , 
k > 0, and again assuming 9 is in radians), 

q 2 q3 

exp [u>] x = I + 9[h\ x + —[h] 2 x + —[h}l + ■ ■ ■ 

q 3 g2 03 

= I+(0__ + ...)[n] x +( y -- + ...)[n] 2 x 

= J + sin0[n] x + (1 — cos^n]^., (2.38) 


which yields the familiar Rodriguez’s formula. 


Unit quaternions 

The unit quaternion representation is closely related to the angle/axis representation. A unit 
quaternion is a unit length 4-vector whose components can be written as q = ( q x , q y , q Zl q w ) 
or q = (x, y , z, w) for short. Unit quaternions live on the unit sphere ||q|| = 1 and antipodal 
(opposite sign) quaternions, q and —q. represent the same rotation (Figure 2.6). Other than 
this ambiguity (dual covering), the unit quaternion representation of a rotation is unique. 
Furthermore, the representation is continuous , i.e., as rotation matrices vary continuously, 
one can find a continuous quaternion representation, although the path on the quaternion 
sphere may wrap all the way around before returning to the “origin” q Q = (0, 0, 0, 1). For 
these and other reasons given below, quaternions are a very popular representation for pose 
and for pose interpolation in computer graphics (Shoemake 1985). 

Quaternions can be derived from the axis/angle representation through the formula 
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Figure 2.6 Unit quaternions live on the unit sphere ||q|j = 1. This figure shows a smooth 
trajectory through the three quaternions q 0 , q t , and q 2 . The antipodal point to q 2 , namely 
— q 2 , represents the same rotation as q>. 


where n and 9 are the rotation axis and angle. Using the trigonometric identities sin 9 = 
2 sin | cos | and (1 — cos 9) = 2 sin 2 |, Rodriguez’s formula can be converted to 

R(n,9) = I + sin0[n] x + (1 — cos0)[n] 2 

= I + 2w[v] x +2[v} 2 x . (2.40) 


This suggests a quick way to rotate a vector v by a quaternion using a series of cross products, 
scalings, and additions. To obtain a formula for R(q) as a function of ( x , y, z, w ), recall that 


0 

—z 

y 


' -y 2 -z 2 

xy 

xz 

z 

0 

— X 

and [t;] 2 = 

xy 

2 2 
—x — Z 

yz 

. ~ y 

X 

0 


XZ 

yz 

2 2 
—x — y z 


We thus obtain 


R(q) 


1 - 2 (y 2 + z 2 ) 
2(xy + zw ) 

2 (xz — yw ) 


2 (xy — zw) 2 (xz + yw) 

1 — 2(x 2 + z 2 ) 2,{yz — xw) 

2 (yz + xw) 1 — 2(a; 2 + y 2 ) 


(2.41) 


The diagonal terms can be made more symmetrical by replacing 1 — 2 (y 2 + z 2 ) with (x 2 + 
w 2 — y 2 — z 2 ), etc. 

The nicest aspect of unit quaternions is that there is a simple algebra for composing rota- 
tions expressed as unit quaternions. Given two quaternions q Q = (vq, wq) and y 1 = (ui,wi), 
the quaternion multiply operator is defined as 


<?2 = QoQl = ( v 0 X1>1 + w 0 v 1 + W 1 « 0 , W 0 W! - V 0 ■ Ui), 


(2.42) 
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with the property that R{q 2 ) — R(qt))R(q \ )■ Note that quaternion multiplication is not 
commutative, just as 3D rotations and matrix multiplications are not. 

Taking the inverse of a quaternion is easy: Just flip the sign of v or w (but not both!). 
(You can verify this has the desired effect of transposing the R matrix in (2.41).) Thus, we 
can also define quaternion division as 

<7 2 = <7o/<7i = = («o xwi + - wiw 0 , -wo«T - v 0 • t>i). (2.43) 


This is useful when the incremental rotation between two rotations is desired. 

In particular, if we want to determine a rotation that is partway between two given rota- 
tions, we can compute the incremental rotation, take a fraction of the angle, and compute the 
new rotation. This procedure is called spherical linear interpolation or slerp for short (Shoe- 
make 1985) and is given in Algorithm 2.1. Note that Shoemake presents two formulas other 
than the one given here. The first exponentiates q r by alpha before multiplying the original 
quaternion, 

<72 = <7?<7o; (2.44) 

while the second treats the quaternions as 4-vectors on a sphere and uses 


sin(l — a)9 sin a# 

<72 = ZIZTn <?o + T7T7r<?l’ 


sin ( 


sm I 


(2.45) 


where 9 = cos _1 (q 0 • q x ) and the dot product is directly between the quaternion 4-vectors. 
All of these formulas give comparable results, although care should be taken when q 0 and q 1 
are close together, which is why I prefer to use an arctangent to establish the rotation angle. 


Which rotation representation is better? 

The choice of representation for 3D rotations depends partly on the application. 

The axis/angle representation is minimal, and hence does not require any additional con- 
straints on the parameters (no need to re-normalize after each update). If the angle is ex- 
pressed in degrees, it is easier to understand the pose (say, 90° twist around c-axis), and also 
easier to express exact rotations. When the angle is in radians, the derivatives of R with 
respect to u> can easily be computed (2.36). 

Quaternions, on the other hand, are better if you want to keep track of a smoothly moving 
camera, since there are no discontinuities in the representation. It is also easier to interpolate 
between rotations and to chain rigid transformations (Murray, Li, and Sastry 1994; Bregler 
and Malik 1998). 

My usual preference is to use quaternions, but to update their estimates using an incre- 
mental rotation, as described in Section 6.2.2. 
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procedure slerp(q 0 , q 1 ,a): 

1- <?r = 9l/<?0 = (v r ,W r ) 

2. if w r < 0 then q r < q r 


3. 0 r 


2tan 1 (||w r ||/u; r .) 


4. h r 


A f{v r ) = v r /\\v r \\ 


5. 0 a = a 0 r 

6. q a = (sin ^fn r , cos ^f) 

7. return q 2 = q Q q 0 


Algorithm 2.1 Spherical linear interpolation (slerp). The axis and total angle are first com- 
puted from the quaternion ratio. (This computation can be lifted outside an inner loop that 
generates a set of interpolated position for animation.) An incremental quaternion is then 
computed and multiplied by the starting rotation quaternion. 

2.1.5 3D to 2D projections 

Now that we know how to represent 2D and 3D geometric primitives and how to transform 
them spatially, we need to specify how 3D primitives are projected onto the image plane. We 
can do this using a linear 3D to 2D projection matrix. The simplest model is orthography, 
which requires no division to get the final (inhomogeneous) result. The more commonly used 
model is perspective, since this more accurately models the behavior of real cameras. 

Orthography and para-perspective 

An orthographic projection simply drops the z component of the three-dimensional coordi- 
nate p to obtain the 2D point x. (In this section, we use p to denote 3D points and x to denote 
2D points.) This can be written as 


X = [1 2 X 2 1 0] P 


(2.46) 


If we are using homogeneous (projective) coordinates, we can write 


10 0 0 

x = 0 1 0 0 p, 

0 0 0 1 


(2.47) 
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(a) 3D view 



(c) scaled orthography 



(e) perspective 




(d) para-perspective 



(f) object-centered 


Figure 2.7 Commonly used projection models: (a) 3D view of world, (b) orthography, (c) 
scaled orthography, (d) para-perspective, (e) perspective, (f) object-centered. Each diagram 
shows a top-down view of the projection. Note how parallel lines on the ground plane and 
box sides remain parallel in the non-perspective projections. 
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i.e., we drop the z component but keep the w component. Orthography is an approximate 
model for long focal length (telephoto) lenses and objects whose depth is shallow relative 
to their distance to the camera (Sawhney and Hanson 1991). It is exact only for telecentric 
lenses (Baker and Nayar 1999, 2001). 

In practice, world coordinates (which may measure dimensions in meters) need to be 
scaled to fit onto an image sensor (physically measured in millimeters, but ultimately mea- 
sured in pixels). For this reason, scaled orthography is actually more commonly used. 


This model is equivalent to first projecting the world points onto a local fronto-parallel image 
plane and then scaling this image using regular perspective projection. The scaling can be the 
same for all parts of the scene (Figure 2.7b) or it can be different for objects that are being 
modeled independently (Figure 2.7c). More importantly, the scaling can vary from frame to 
frame when estimating structure from motion , which can better model the scale change that 
occurs as an object approaches the camera. 

Scaled orthography is a popular model for reconstructing the 3D shape of objects far away 
from the camera, since it greatly simplifies certain computations. For example, pose (camera 
orientation) can be estimated using simple least squares (Section 6.2.1). Under orthography, 
structure and motion can simultaneously be estimated using factorization (singular value de- 
composition), as discussed in Section 7.3 (Tomasi and Kanade 1992). 

A closely related projection model is para-perspective (Aloimonos 1990; Poelman and 
Kanade 1997). In this model, object points are again first projected onto a local reference 
parallel to the image plane. However, rather than being projected orthogonally to this plane, 
they are projected parallel to the line of sight to the object center (Figure 2.7d). This is 
followed by the usual projection onto the final image plane, which again amounts to a scaling. 
The combination of these two projections is therefore affine and can be written as 


Note how parallel lines in 3D remain parallel after projection in Figure 2.7b-d. Para-perspective 
provides a more accurate projection model than scaled orthography, without incurring the 
added complexity of per-pixel perspective division, which invalidates traditional factoriza- 
tion methods (Poelman and Kanade 1997). 


x = [sJ 2X 2|0] p. 


(2.48) 


Ooo floi a 02 °03 

x = am an 012 ai3 p. 

0 0 0 1 


(2.49) 


Perspective 

The most commonly used projection in computer graphics and computer vision is true 3D 
perspective (Figure 2.7e). Here, points are projected onto the image plane by dividing them 
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by their z component. Using inhomogeneous coordinates, this can be written as 


x/ z 

x = V z {p)= y/z 
1 


(2.50) 


In homogeneous coordinates, the projection has a simple linear form. 


10 0 0 

x = 0 1 0 0 p, 

0 0 10 


(2.51) 


i.e., we drop the w component of p. Thus, after projection, it is not possible to recover the 
distance of the 3D point from the image, which makes sense for a 2D imaging sensor. 

A form often seen in computer graphics systems is a two-step projection that first projects 
3D coordinates into normalized device coordinates in the range (x,y,z) £ [-1,-1] x 
[—1,1] x [0,1], and then rescales these coordinates to integer pixel coordinates using a view- 
port transformation (Watt 1995; OpenGL-ARB 1997). The (initial) perspective projection 
is then represented using a 4 x 4 matrix 


where z ne ar and Zf ar are the near and far z clipping planes and z range = Zf al — z near . Note 
that the first two rows are actually scaled by the focal length and the aspect ratio so that 
visible rays are mapped to (a;, y, z) £ [—1, — l] 2 . The reason for keeping the third row, rather 
than dropping it, is that visibility operations, such as z-buffering, require a depth for every 
graphical element that is being rendered. 

If we set z neal = 1, 2 f ar — > oo, and switch the sign of the third row, the third element 
of the normalized screen vector becomes the inverse depth, i.e., the disparity (Okutomi and 
Kanade 1993). This can be quite convenient in many cases since, for cameras moving around 
outdoors, the inverse depth to the camera is often a more well-conditioned parameterization 
than direct 3D distance. 

While a regular 2D image sensor has no way of measuring distance to a surface point, 
range sensors (Section 12.2) and stereo matching algorithms (Chapter 11) can compute such 
values. It is then convenient to be able to map from a sensor-based depth or disparity value d 
directly back to a 3D location using the inverse of a 4 x 4 matrix (Section 2.1.5). We can do 
this if we represent perspective projection using a full-rank 4x4 matrix, as in (2.64). 


10 0 
0 1 0 


0 

0 


(2.52) 


0 0 1 
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Figure 2.8 Projection of a 3D camera-centered point p c onto the sensor planes at location 
p. O c is the camera center (nodal point), c s is the 3D origin of the sensor plane coordinate 
system, and s x and s y are the pixel spacings. 


Camera intrinsics 


Once we have projected a 3D point through an ideal pinhole using a projection matrix, we 
must still transform the resulting coordinates according to the pixel sensor spacing and the 
relative position of the sensor plane to the origin. Figure 2.8 shows an illustration of the 
geometry involved. In this section, we first present a mapping from 2D pixel coordinates to 
3D rays using a sensor homography M s , since this is easier to explain in terms of physically 
measurable quantities. We then relate these quantities to the more commonly used camera in- 
trinsic matrix K, which is used to map 3D camera-centered points p c to 2D pixel coordinates 
x s . 

Image sensors return pixel values indexed by integer pixel coordinates (x s ,y s ), often 
with the coordinates starting at the upper-left corner of the image and moving down and to 
the right. (This convention is not obeyed by all imaging libraries, but the adjustment for 
other coordinate systems is straightforward.) To map pixel centers to 3D coordinates, we first 
scale the ( x s , y s ) values by the pixel spacings ( s x , s y ) (sometimes expressed in microns for 
solid-state sensors) and then describe the orientation of the sensor array relative to the camera 
projection center O c with an origin c s and a 3D rotation R s (Figure 2.8). 

The combined 2D to 3D projection can then be written as 


p = 


Rs 


s x 0 0 

0 Sy 0 

0 0 0 

0 0 1 




X s 


Vs 


1 


= AT, X, 


(2.53) 


The first two columns of the 3x3 matrix M s are the 3D vectors corresponding to unit steps 
in the image pixel array along the x s and y s directions, while the third column is the 3D 
image array origin c s . 
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The matrix M s is parameterized by eight unknowns: the three parameters describing 
the rotation R s , the three parameters describing the translation c s , and the two scale factors 
(s x , s v ). Note that we ignore here the possibility of skew between the two axes on the image 
plane, since solid-state manufacturing techniques render this negligible. In practice, unless 
we have accurate external knowledge of the sensor spacing or sensor orientation, there are 
only seven degrees of freedom, since the distance of the sensor from the origin cannot be 
teased apart from the sensor spacing, based on external image measurement alone. 

However, estimating a camera model M s with the required seven degrees of freedom 
(i.e., where the first two columns are orthogonal after an appropriate re-scaling) is impractical, 
so most practitioners assume a general 3x3 homogeneous matrix form. 

The relationship between the 3D pixel center p and the 3D camera-centered point p c is 
given by an unknown scaling s, p = sp c . We can therefore write the complete projection 
between p c and a homogeneous version of the pixel address x s as 

x s = aM~ 1 p c = Kp c . (2.54) 


The 3 x 3 matrix K is called the calibration matrix and describes the camera intrinsics (as 
opposed to the camera’s orientation in space, which are called the extrinsics). 

From the above discussion, we see that K has seven degrees of freedom in theory and 
eight degrees of freedom (the full dimensionality of a 3 x 3 homogeneous matrix) in practice. 
Why, then, do most textbooks on 3D computer vision and multi-view geometry (Faugeras 
1993; Hartley and Zisserman 2004; Faugeras and Luong 2001) treat K as an upper-triangular 
matrix with five degrees of freedom? 

While this is usually not made explicit in these books, it is because we cannot recover 
the full K matrix based on external measurement alone. When calibrating a camera (Chap- 
ter 6) based on external 3D points or other measurements (Tsai 1987), we end up estimating 
the intrinsic (K) and extrinsic (R. t) camera parameters simultaneously using a series of 
measurements, 


x s = K 


R t 


Pw = P Pw , 


(2.55) 


where p w are known 3D world coordinates and 


P = K[R\t] 


(2.56) 


is known as the camera matrix. Inspecting this equation, we see that we can post-multiply 
K by R\ and pre-multiply [i?|£] by Rf, and still end up with a valid calibration. Thus, it 
is impossible based on image measurements alone to know the true orientation of the sensor 
and the true camera intrinsics. 

The choice of an upper-triangular form for K seems to be conventional. Given a full 
3x4 camera matrix P = K[R\t\, we can compute an upper-triangular K matrix using QR 
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W-l 



Figure 2.9 Simplified camera intrinsics showing the focal length / and the optical center 
( c x . c y ). The image width and height are W and H. 


factorization (Golub and Van Loan 1996). (Note the unfortunate clash of terminologies: In 
matrix algebra textbooks, R represents an upper-triangular (right of the diagonal) matrix; in 
computer vision, R is an orthogonal rotation.) 

There are several ways to write the upper-triangular form of K. One possibility is 


K = 


fx S Cx 

0 /y Cy , 

0 0 1 


(2.57) 


which uses independent focal lengths f x and f y for the sensor x and y dimensions. The entry 
s encodes any possible skew between the sensor axes due to the sensor not being mounted 
perpendicular to the optical axis and ( c x ,c y ) denotes the optical center expressed in pixel 
coordinates. Another possibility is 


K = 


f s c x 

0 af c y , 

0 0 1 


(2.58) 


where the aspect ratio a has been made explicit and a common focal length / is used. 

In practice, for many applications an even simpler form can be obtained by setting a = 1 


and s = 0, 


K 


f 0 c x 

0 / Cy 

0 0 1 


(2.59) 


Often, setting the origin at roughly the center of the image, e.g., ( c x ,c y ) = (W/2,H/2), 
where W and H are the image height and width, can result in a perfectly usable camera 
model with a single unknown, i.e., the focal length /. 
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Figure 2.10 Central projection, showing the relationship between the 3D and 2D coordi- 
nates, p and x, as well as the relationship between the focal length /, image width W, and 
the field of view 9. 

Figure 2.9 shows how these quantities can be visualized as part of a simplified imaging 
model. Note that now we have placed the image plane in front of the nodal point (projection 
center of the lens). The sense of the y axis has also been flipped to get a coordinate system 
compatible with the way that most imaging libraries treat the vertical (row) coordinate. Cer- 
tain graphics libraries, such as Direct3D, use a left-handed coordinate system, which can lead 
to some confusion. 


A note on focal lengths 


The issue of how to express focal lengths is one that often causes confusion in implementing 
computer vision algorithms and discussing their results. This is because the focal length 
depends on the units used to measure pixels. 

If we number pixel coordinates using integer values, say [0, W) x [0, H), the focal length 
/ and camera center (c x , c y ) in (2.59) can be expressed as pixel values. How do these quan- 
tities relate to the more familiar focal lengths used by photographers? 

Figure 2.10 illustrates the relationship between the focal length /, the sensor width W, 
and the field of view 9, which obey the formula 


9 

tan - = 


W 

V 


or / 



(2.60) 


For conventional film cameras, W = 35mm, and hence / is also expressed in millimeters. 
Since we work with digital images, it is more convenient to express W in pixels so that the 
focal length / can be used directly in the calibration matrix K as in (2.59). 

Another possibility is to scale the pixel coordinates so that they go from [—1, 1) along 
the longer image dimension and [—a -1 , a -1 ) along the shorter axis, where a > 1 is the 
image aspect ratio (as opposed to the sensor cell aspect ratio introduced earlier). This can be 
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accomplished using modified normalized device coordinates, 

x' s = (2x s — W)/S and y' s = (2y s — H) /S, where S' = max (IV, H). (2.61) 

This has the advantage that the focal length / and optical center (c x , c y ) become independent 
of the image resolution, which can be useful when using multi-resolution, image -processing 
algorithms, such as image pyramids (Section 3.5). 2 The use of S instead of W also makes the 
focal length the same for landscape (horizontal) and portrait (vertical) pictures, as is the case 
in 35mm photography. (In some computer graphics textbooks and systems, normalized device 
coordinates go from [— 1, 1] x [—1, 1], which requires the use of two different focal lengths 
to describe the camera intrinsics (Watt 1995; OpenGL-ARB 1997).) Setting S = W = 2 in 
(2.60), we obtain the simpler (unitless) relationship 

/ -1 = tan (2.62) 

The conversion between the various focal length representations is straightforward, e.g., 
to go from a unitless / to one expressed in pixels, multiply by W/2, while to convert from an 
/ expressed in pixels to the equivalent 35mm focal length, multiply by 35 jW . 


Camera matrix 


Now that we have shown how to parameterize the calibration matrix K, we can put the 
camera intrinsics and extrinsics together to obtain a single 3x4 camera matrix 


P = K 


R t 


(2.63) 


It is sometimes preferable to use an invertible 4x4 matrix, which can be obtained by not 
dropping the last row in the P matrix. 


K 0 


R t 

0 T 1 


0 T 1 


KE , 


(2.64) 


where E is a 3D rigid-body (Euclidean) transformation and K is the full-rank calibration 
matrix. The 4x4 camera matrix P can be used to map directly from 3D world coordinates 
P w = ( x wi Vwi z w, 1) t0 screen coordinates (plus disparity), x s = (x s ,y s ,l,d). 


x 


S 


Pp w , 


(2.65) 


where ~ indicates equality up to scale. Note that after multiplication by P, the vector is 
divided by the third element of the vector to obtain the normalized form x s = (x s , y s ,l,d). 

2 To make the conversion truly accurate after a downsampling step in a pyramid, floating point values of W and 
H would have to be maintained since they can become non-integral if they are ever odd at a larger resolution in the 
pyramid. 
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d= 1.0 d= 0.67 d= 0.5 d 



d= 0.5 rf=0 rf=-0.25 



Figure 2.11 Regular disparity (inverse depth) and projective depth (parallax from a reference 
plane). 


Plane plus parallax (projective depth) 

In general, when using the 4x4 matrix P, we have the freedom to remap the last row to 
whatever suits our purpose (rather than just being the “standard” interpretation of disparity as 
inverse depth). Let us re-write the last row of P as p 3 = S 3 [ri-o |co] , where ||no|| = 1. We 
then have the equation 

d= — (n Q -p w + c 0 ), (2.66) 

where z = p 2 ■ p w = r z ■ (p w — c) is the distance of p w from the camera center C (2.25) 
along the optical axis Z (Figure 2.11). Thus, we can interpret d as the projective disparity 
or projective depth of a 3D scene point p w from the reference plane no ■ p w + cq = 0 
(Szeliski and Coughlan 1997; Szeliski and Golland 1999; Shade, Gortler, He et al. 1998; 
Baker, Szeliski, and Anandan 1998). (The projective depth is also sometimes called parallax 
in reconstruction algorithms that use the term plane plus parallax (Kumar, Anandan, and 
Hanna 1994; Sawhney 1994).) Setting no = 0 and Co = 1, i.e., putting the reference plane 
at infinity, results in the more standard d = 1/z version of disparity (Okutomi and Kanade 
1993). 

Another way to see this is to invert the P matrix so that we can map pixels plus disparity 
directly back to 3D points, 

P w = p 'x s . (2.67) 

In general, we can choose P to have whatever form is convenient, i.e., to sample space us- 
ing an arbitrary projection. This can come in particularly handy when setting up multi-view 
stereo reconstruction algorithms, since it allows us to sweep a series of planes (Section 1 1 . 1 .2) 
through space with a variable (projective) sampling that best matches the sensed image mo- 
tions (Collins 1996; Szeliski and Golland 1999; Saito and Kanade 1999). 
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Figure 2.12 A point is projected into two images: (a) relationship between the 3D point co- 
ordinate (X, Y, Z. 1) and the 2D projected point (x. y, 1 . d); (b) planar homography induced 
by points all lying on a common plane no • p + cq = 0. 

Mapping from one camera to another 

What happens when we take two images of a 3D scene from different camera positions or 
orientations (Figure 2.12a)? Using the full rank 4x4 camera matrix P = KE from (2.64), 
we can write the projection from world to screen coordinates as 

&o ~ K Q E Q p = P 0 p. (2.68) 

Assuming that we know the z-buffer or disparity value do for a pixel in one image, we can 
compute the 3D point location p using 

p-E^K^xo (2.69) 

and then project it into another image yielding 

x-i ~ KxEtf = K 1 E 1 E^ 1 K 0 1 x Q = P\P 0 1 x 0 = M 10 x 0 . (2.70) 

Unfortunately, we do not usually have access to the depth coordinates of pixels in a regular 
photographic image. However, for a planar scene, as discussed above in (2.66), we can 
replace the last row of P 0 in (2.64) with a general plane equation, n 0 ■ p + Co that maps 
points on the plane to do = 0 values (Figure 2.12b). Thus, if we set do = 0, we can ignore 
the last column of Mio in (2.70) and also its last row, since we do not care about the final 
z-buffer depth. The mapping equation (2.70) thus reduces to 

®i ~ H w x 0 , (2.71) 

where H\q is a general 3x3 homography matrix and x-\ and x t] are now 2D homogeneous 
coordinates (i.e., 3-vectors) (Szeliski 1996).This justifies the use of the 8-parameter homog- 
raphy as a general alignment model for mosaics of planar scenes (Mann and Picard 1994; 
Szeliski 1996). 
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The other special case where we do not need to know depth to perform inter-camera 
mapping is when the camera is undergoing pure rotation (Section 9.1.3), i.e., when t 0 = t\. 
In this case, we can write 


Xi ~ K\RiR a 1 _RT 0 1 Xq = KiRiqK q 1 Xq, (2.72) 

which again can be represented with a 3 x 3 homography. If we assume that the calibration 
matrices have known aspect ratios and centers of projection (2.59), this homography can be 
parameterized by the rotation amount and the two unknown focal lengths. This particular 
formulation is commonly used in image-stitching applications (Section 9.1.3). 


Object-centered projection 


When working with long focal length lenses, it often becomes difficult to reliably estimate 
the focal length from image measurements alone. This is because the focal length and the 
distance to the object are highly correlated and it becomes difficult to tease these two effects 
apart. For example, the change in scale of an object viewed through a zoom telephoto lens 
can either be due to a zoom change or a motion towards the user. (This effect was put to 
dramatic use in some of Alfred Hitchcock’s film Vertigo, where the simultaneous change of 
zoom and camera motion produces a disquieting effect.) 

This ambiguity becomes clearer if we write out the projection equation corresponding to 
the simple calibration matrix K (2.59), 

x s = 

Vs = 


■p + t x 


■P + t z 
■P+ty 


P + t Z 


(2.73) 

(2.74) 


where r x , r y , and r z are the three rows of R. If the distance to the object center t z » ||p|| 
(the size of the object), the denominator is approximately t z and the overall scale of the 
projected object depends on the ratio of / to t z . It therefore becomes difficult to disentangle 
these two quantities. 

To see this more clearly, let //, = fj 1 and s = r/ z f. We can then re-write the above 
equations as 


x s 


Vs 


^ r x ■ p + t x 
1 + tlzTz ■ P 


+ C X 


TyP + ty 

s - — ! f c v 

1 + r] z r z ■ p 


(2.75) 

(2.76) 


(Szeliski and Kang 1994; Pighin, Hecker, Lischinski et al. 1998). The scale of the projection 
s can be reliably estimated if we are looking at a known object (i.e., the 3D coordinates p 
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are known). The inverse distance r} z is now mostly decoupled from the estimates of s and 
can be estimated from the amount of foreshortening as the object rotates. Furthermore, as 
the lens becomes longer, i.e., the projection model becomes orthographic, there is no need to 
replace a perspective imaging model with an orthographic one, since the same equation can 
be used, with q z — 0 (as opposed to / and t z both going to infinity). This allows us to form 
a natural link between orthographic reconstruction techniques such as factorization and their 
projective/perspective counterparts (Section 7.3). 


2.1.6 Lens distortions 


The above imaging models all assume that cameras obey a linear projection model where 
straight lines in the world result in straight lines in the image. (This follows as a natural 
consequence of linear matrix operations being applied to homogeneous coordinates.) Unfor- 
tunately, many wide-angle lenses have noticeable radial distortion , which manifests itself as 
a visible curvature in the projection of straight lines. (See Section 2.2.3 for a more detailed 
discussion of lens optics, including chromatic aberration.) Unless this distortion is taken into 
account, it becomes impossible to create highly accurate photorealistic reconstructions. For 
example, image mosaics constructed without taking radial distortion into account will often 
exhibit blurring due to the mis -registration of corresponding features before pixel blending 
(Chapter 9). 

Fortunately, compensating for radial distortion is not that difficult in practice. For most 
lenses, a simple quartic model of distortion can produce good results. Let (x c ,y c ) be the 
pixel coordinates obtained after perspective division but before scaling by focal length / and 
shifting by the optical center ( c x ,c y ), i.e.. 


x c 


Vc 


r x -p + t x 
r z -p + t z 
r VP + ty 

r z'P + t Z ' 


(2.77) 


The radial distortion model says that coordinates in the observed images are displaced away 
(, barrel distortion) or towards ( pincushion distortion) the image center by an amount propor- 
tional to their radial distance (Figure 2.13a-b). 3 The simplest radial distortion models use 
low-order polynomials, e.g.. 


X c = X c (l + Ki r 2 c + K 2 fc) 

y c = Vc(l + Kirl + H 2 r^), (2.78) 

3 Anamorphic lenses, which are widely used in feature film production, do not follow this radial distortion model. 
Instead, they can be thought of, to a first approximation, as inducing different vertical and horizontal scalings, i.e., 
non-square pixels. 



Figure 2.13 Radial lens distortions: (a) barrel, (b) pincushion, and (c) fisheye. The hsheye 
image spans almost 180° from side-to-side. 


where + y % and n\ and K 2 are called the radial distortion parameters . 4 After the 

radial distortion step, the final pixel coordinates can be computed using 

x s = fx' c + c x 

ys = fy'c + Cy. (2.79) 

A variety of techniques can be used to estimate the radial distortion parameters for a given 
lens, as discussed in Section 6.3.5. 

Sometimes the above simplified model does not model the tme distortions produced by 
complex lenses accurately enough (especially at very wide angles). A more complete ana- 
lytic model also includes tangential distortions and decentering distortions (Slama 1980), but 
these distortions are not covered in this book. 

Fisheye lenses (Figure 2.13c) require a model that differs from traditional polynomial 
models of radial distortion. Fisheye lenses behave, to a first approximation, as equi-distance 
projectors of angles away from the optical axis (Xiong and Turkowski 1997), which is the 
same as the polar projection described by Equations (9.22-9.24). Xiong and Turkowski 
(1997) describe how this model can be extended with the addition of an extra quadratic cor- 
rection in 0 and how the unknown parameters (center of projection, scaling factor s, etc.) 
can be estimated from a set of overlapping hsheye images using a direct (intensity-based) 
non-linear minimization algorithm. 

For even larger, less regular distortions, a parametric distortion model using splines may 
be necessary (Goshtasby 1989). If the lens does not have a single center of projection, it 

4 Sometimes the relationship between x c and x c is expressed the other way around, i.e., x c = x c (l + k \ rf. + 
K 2 r 4 ). This is convenient if we map image pixels into (warped) rays by dividing through by /. We can then undistort 
the rays and have true 3D rays in space. 
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may become necessary to model the 3D line (as opposed to direction ) corresponding to each 
pixel separately (Gremban, Thorpe, and Kanade 1988; Champleboux, Lavallee, Sautot et al. 
1992; Grossberg and Nayar 2001; Sturm and Ramalingam 2004; Tardif, Sturm, Trudeau et 
al. 2009). Some of these techniques are described in more detail in Section 6.3.5, which 
discusses how to calibrate lens distortions. 

There is one subtle issue associated with the simple radial distortion model that is often 
glossed over. We have introduced a non-linearity between the perspective projection and final 
sensor array projection steps. Therefore, we cannot, in general, post-multiply an arbitrary 3 x 
3 matrix K with a rotation to put it into upper-triangular form and absorb this into the global 
rotation. However, this situation is not as bad as it may at first appear. For many applications, 
keeping the simplified diagonal form of (2.59) is still an adequate model. Furthermore, if we 
correct radial and other distortions to an accuracy where straight lines are preserved, we have 
essentially converted the sensor back into a linear imager and the previous decomposition still 
applies. 

2.2 Photometric image formation 

In modeling the image formation process, we have described how 3D geometric features in 
the world are projected into 2D features in an image. However, images are not composed of 
2D features. Instead, they are made up of discrete color or intensity values. Where do these 
values come from? How do they relate to the lighting in the environment, surface properties 
and geometry, camera optics, and sensor properties (Figure 2.14)? In this section, we develop 
a set of models to describe these interactions and formulate a generative process of image 
formation. A more detailed treatment of these topics can be found in other textbooks on 
computer graphics and image synthesis (Glassner 1995; Weyrich, Lawrence, Lensch et al. 
2008; Foley, van Dam, Feiner et al. 1995; Watt 1995; Cohen and Wallace 1993; Sillion and 
Puech 1994). 


2.2.1 Lighting 

Images cannot exist without light. To produce an image, the scene must be illuminated with 
one or more light sources. (Certain modalities such as fluorescent microscopy and X-ray 
tomography do not fit this model, but we do not deal with them in this book.) Light sources 
can generally be divided into point and area light sources. 

A point light source originates at a single location in space (e.g., a small light bulb), 
potentially at infinity (e.g., the sun). (Note that for some applications such as modeling soft 
shadows (penumbras), the sun may have to be treated as an area light source.) In addition to 
its location, a point light source has an intensity and a color spectrum, i.e., a distribution over 
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Figure 2.14 A simplified model of photometric image formation. Light is emitted by one 
or more light sources and is then reflected from an object’s surface. A portion of this light is 
directed towards the camera. This simplified model ignores multiple reflections, which often 
occur in real-world scenes. 

wavelengths L( A). The intensity of a light source falls off with the square of the distance 
between the source and the object being lit, because the same light is being spread over a 
larger (spherical) area. A light source may also have a directional falloff (dependence), but 
we ignore this in our simplified model. 

Area light sources are more complicated. A simple area light source such as a fluorescent 
ceiling light fixture with a diffuser can be modeled as a finite rectangular area emitting light 
equally in all directions (Cohen and Wallace 1993; Sillion and Puech 1994; Glassner 1995). 
When the distribution is strongly directional, a four-dimensional lightfield can be used instead 
(Ashdown 1993). 

A more complex light distribution that approximates, say, the incident illumination on an 
object sitting in an outdoor courtyard, can often be represented using an environment map 
(Greene 1986) (originally called a reflection map (Blinn and Newell 1976)). This representa- 
tion maps incident light directions v to color values (or wavelengths. A), 

£(«; A), (2.80) 

and is equivalent to assuming that all light sources are at infinity. Environment maps can be 
represented as a collection of cubical faces (Greene 1986), as a single longitude-latitude map 
(Blinn and Newell 1976), or as the image of a reflecting sphere (Watt 1995). A convenient 
way to get a rough model of a real-world environment map is to take an image of a reflective 
mirrored sphere and to unwrap this image onto the desired environment map (Debevec 1998). 
Watt (1995) gives a nice discussion of environment mapping, including the formulas needed 
to map directions to pixels for the three most commonly used representations. 
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Figure 2.15 (a) Light scatters when it hits a surface, (b) The bidirectional reflectance 

distribution function (BRDF) f(di, <j>i, 9 r , (f> r ) is parameterized by the angles that the inci- 
dent, Vi, and reflected, v r , light ray directions make with the local surface coordinate frame 


2.2.2 Reflectance and shading 

When light hits an object’s surface, it is scattered and reflected (Figure 2. 15a). Many different 
models have been developed to describe this interaction. In this section, we first describe the 
most general form, the bidirectional reflectance distribution function, and then look at some 
more specialized models, including the diffuse, specular, and Phong shading models. We also 
discuss how these models can be used to compute the global illumination corresponding to a 
scene. 

The Bidirectional Reflectance Distribution Function (BRDF) 

The most general model of light scattering is the bidirectional reflectance distribution func- 
tion (BRDF). 5 Relative to some local coordinate frame on the surface, the BRDF is a four- 
dimensional function that describes how much of each wavelength arriving at an incident 
direction Vi is emitted in a reflected direction v r (Figure 2.15b). The function can be written 
in terms of the angles of the incident and reflected directions relative to the surface frame as 


The BRDF is reciprocal , i.e., because of the physics of light transport, you can interchange 
the roles of Vj and v r and still get the same answer (this is sometimes called Helmholtz 
reciprocity ). 


the surface, sub-surface scattering, and atmospheric effects — see Section 12.7.1 — (Dorsey, Rushmeier, and Sillion 
2007; Weyrich, Lawrence, Lensch et al. 2008). 



&r-> ^) • 


(2.81) 


5 Actually, even more general models of light transport exist, including some that model spatial variation along 
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Most surfaces are isotropic, i.e., there are no preferred directions on the surface as far 
as light transport is concerned. (The exceptions are anisotropic surfaces such as brushed 
(scratched) aluminum, where the reflectance depends on the light orientation relative to the 
direction of the scratches.) For an isotropic material, we can simplify the BRDF to 

fr(8i,8r, |0r-0*|;A) or f r (vi,v r ,h;X), (2.82) 

since the quantities 9i, 0 r and 0 r — 0 % can be computed from the directions v,,. v r , and ft. 

To calculate the amount of light exiting a surface point p in a direction v r under a given 
lighting condition, we integrate the product of the incoming light A) with the BRDF 

(some authors call this step a convolution). Taking into account the foreshortening factor 
cos + 9i, we obtain 

L r (v r \ A) = J L^vp, A )f r (vi, v r , ft] A) cos + 9 t dv z , (2.83) 

where 

cos + 9i = max(0,cos6 l i). (2.84) 

If the light sources are discrete (a finite number of point light sources), we can replace the 
integral with a summation, 

L r (v r ; A) = ^2 Li(X)f r (vi,v r , ft] A) cos + 0*. (2.85) 

i 

BRDFs for a given surface can be obtained through physical modeling (Torrance and 
Sparrow 1967; Cook and Torrance 1982; Glassner 1995), heuristic modeling (Phong 1975), or 
through empirical observation (Ward 1992; Westin, Arvo, and Torrance 1992; Dana, van Gin- 
neken, Nayar el al. 1999; Dorsey, Rushmeier, and Sillion 2007; Weyrich, Lawrence, Lensch 
et al. 2008). 6 Typical BRDFs can often be split into their diffuse and specular components, 
as described below. 

Diffuse reflection 

The diffuse component (also known as Lambertian or matte reflection) scatters light uni- 
formly in all directions and is the phenomenon we most normally associate with shading, 
e.g., the smooth (non-shiny) variation of intensity with surface normal that is seen when ob- 
serving a statue (Figure 2.16). Diffuse reflection also often imparts a strong body color to 
the light since it is caused by selective absorption and re-emission of light inside the object’s 
material (Shafer 1985; Glassner 1995). 


See http://wwwl.cs.columbia.edu/CAVE/software/curet/ for a database of some empirically sampled BRDFs. 
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Figure 2.16 This close-up of a statue shows both diffuse (smooth shading) and specular 
(shiny highlight) reflection, as well as darkening in the grooves and creases due to reduced 
light visibility and interreflections. (Photo courtesy of the Caltech Vision Lab, http://www. 
vision.caltech.edu/archive.html.) 

While light is scattered uniformly in all directions, i.e., the BRDF is constant, 

f d (vi,v r ,n; A) = f d (X), (2.86) 

the amount of light depends on the angle between the incident light direction and the surface 
normal 6i. This is because the surface area exposed to a given amount of light becomes larger 
at oblique angles, becoming completely self-shadowed as the outgoing surface normal points 
away from the light (Figure 2.17a). (Think about how you orient yourself towards the sun or 
fireplace to get maximum warmth and how a flashlight projected obliquely against a wall is 
less bright than one pointing directly at it.) The shading equation for diffuse reflection can 
thus be written as 

L d (v r ;X) = ^2 Li(X) f d (X) cos + 9i = ^L i (A)/ d (A)[f) i • h} + , (2.87) 

i i 

where 

[i>i ■ n) + = max(0, Vi • n). (2.88) 


Specular reflection 

The second major component of a typical BRDF is specular (gloss or highlight) reflection, 
which depends strongly on the direction of the outgoing light. Consider light reflecting off a 
mirrored surface (Figure 2.17b). Incident light rays are reflected in a direction that is rotated 
by 180° around the surface normal h. Using the same notation as in Equations (2.29-2.30), 
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Figure 2.17 (a) The diminution of returned light caused by foreshortening depends on v t ■ ft, 
the cosine of the angle between the incident light direction v t and the surface normal ft. (b) 
Mirror (specular) reflection: The incident light ray direction Vi is reflected onto the specular 
direction §i around the surface normal ft. 


we can compute the specular reflection direction s. t as 

Si = ity — u_l = ( 2ftft T — I)Vi . (2.89) 


The amount of light reflected in a given direction v r thus depends on the angle () H = 
cos _1 (f) r • Si) between the view direction v r and the specular direction .s, . For example, the 
Phong (1975) model uses a power of the cosine of the angle, 

fs(S s ; A) = k s (X) cos fce 9 S , (2.90) 

while the Torrance and Sparrow (1967) micro-facet model uses a Gaussian, 

/«(#«; A) = k s ( A) exp(-c^). (2.91) 

Larger exponents k e (or inverse Gaussian widths c s ) correspond to more specular surfaces 
with distinct highlights, while smaller exponents better model materials with softer gloss. 

Phong shading 

Phong (1975) combined the diffuse and specular components of reflection with another term, 
which he called the ambient illumination. This term accounts for the fact that objects are 
generally illuminated not only by point light sources but also by a general diffuse illumination 
corresponding to inter-reflection (e.g., the walls in a room) or distant sources, such as the 
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Figure 2.18 Cross-section through a Phong shading model BRDF for a fixed incident illu- 
mination direction: (a) component values as a function of angle away from surface normal; 
(b) polar plot. The value of the Phong exponent k e is indicated by the “Exp” labels and the 
light source is at an angle of 30° away from the normal. 

blue sky. In the Phong model, the ambient term does not depend on surface orientation, but 
depends on the color of both the ambient illumination L a ( A) and the object k a ( A), 

f a ( A) = k a (X)L a (X). (2.92) 

Putting all of these terms together, we arrive at the Phong shading model, 

L r (v r ; X) = k a (X)L a (X) + k d (X)^2 Li(X)[vi ■ h} + + k s (X) ^ Li(X)(v r ■ s t ) ke . (2.93) 

i i 

Figure 2.18 shows a typical set of Phong shading model components as a function of the 
angle away from the surface normal (in a plane containing both the lighting direction and the 
viewer). 

Typically, the ambient and diffuse reflection color distributions k a ( A) and k d ( X) are the 
same, since they are both due to sub-surface scattering (body reflection) inside the surface 
material (Shafer 1985). The specular reflection distribution k s ( X) is often uniform (white), 
since it is caused by interface reflections that do not change the light color. (The exception 
to this are metallic materials, such as copper, as opposed to the more common dielectric 
materials, such as plastics.) 

The ambient illumination L a ( X) often has a different color cast from the direct light 
sources Li( A), e.g., it may be blue for a sunny outdoor scene or yellow for an interior lit 
with candles or incandescent lights. (The presence of ambient sky illumination in shadowed 
areas is what often causes shadows to appear bluer than the corresponding lit portions of a 
scene). Note also that the diffuse component of the Phong model (or of any shading model) 
depends on the angle of the incoming light source i)i, while the specular component depends 
on the relative angle between the viewer v r and the specular reflection direction ,s, (which 
itself depends on the incoming light direction Vi and the surface normal n). 
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The Phong shading model has been superseded in terms of physical accuracy by a number 
of more recently developed models in computer graphics, including the model developed by 
Cook and Torrance (1982) based on the original micro-facet model of Torrance and Sparrow 
(1967). Until recently, most computer graphics hardware implemented the Phong model but 
the recent advent of programmable pixel shaders makes the use of more complex models 
feasible. 

Di-chromatic reflection model 

The Torrance and Sparrow (1967) model of reflection also forms the basis of Shafer’s (1985) 
di-chromatic reflection model , which states that the apparent color of a uniform material lit 
from a single source depends on the sum of two terms, 

L r (v r ; A) = Li(v r ,Vi,h]X) + L b (v r ,Vi,n-,X) (2.94) 

= Ci(X)mi(v r ,Vi,h) + c b (X)m b (v ri Vi,h), (2.95) 

i.e., the radiance of the light reflected at the interface, A,, and the radiance reflected at the sur- 
face body. Lb. Each of these, in turn, is a simple product between a relative power spectrum 
c(A), which depends only on wavelength, and a magnitude ni(v r , v t , n ), which depends only 
on geometry. (This model can easily be derived from a generalized version of Phong’s model 
by assuming a single light source and no ambient illumination, and re-arranging terms.) The 
di-chromatic model has been successfully used in computer vision to segment specular col- 
ored objects with large variations in shading (Klinker 1993) and more recently has inspired 
local two-color models for applications such Bayer pattern demosaicing (Bennett, Uytten- 
daele, Zitnick et al. 2006). 

Global illumination (ray tracing and radiosity) 

The simple shading model presented thus far assumes that light rays leave the light sources, 
bounce off surfaces visible to the camera, thereby changing in intensity or color, and arrive 
at the camera. In reality, light sources can be shadowed by occluders and rays can bounce 
multiple times around a scene while making their trip from a light source to the camera. 

Two methods have traditionally been used to model such effects. If the scene is mostly 
specular (the classic example being scenes made of glass objects and mirrored or highly pol- 
ished balls), the preferred approach is ray tracing or path tracing (Glassner 1995; Akenine- 
Moller and Haines 2002; Shirley 2005), which follows individual rays from the camera across 
multiple bounces towards the light sources (or vice versa). If the scene is composed mostly 
of uniform albedo simple geometry illuminators and surfaces, radiosity (globed illumination) 
techniques are preferred (Cohen and Wallace 1993; Sillion and Puech 1994; Glassner 1995). 
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Combinations of the two techniques have also been developed (Wallace, Cohen, and Green- 
berg 1987), as well as more general light transport techniques for simulating effects such as 
the caustics cast by rippling water. 

The basic ray tracing algorithm associates a light ray with each pixel in the camera im- 
age and finds its intersection with the nearest surface. A primary contribution can then be 
computed using the simple shading equations presented previously (e.g.. Equation (2.93)) 
for all light sources that are visible for that surface element. (An alternative technique for 
computing which surfaces are illuminated by a light source is to compute a shadow map, 
or shadow buffer, i.e., a rendering of the scene from the light source’s perspective, and then 
compare the depth of pixels being rendered with the map (Williams 1983; Akenine-Moller 
and Haines 2002).) Additional secondary rays can then be cast along the specular direction 
towards other objects in the scene, keeping track of any attenuation or color change that the 
specular reflection induces. 

Radiosity works by associating lightness values with rectangular surface areas in the scene 
(including area light sources). The amount of light interchanged between any two (mutually 
visible) areas in the scene can be captured as a form factor, which depends on their relative 
orientation and surface reflectance properties, as well as the 1/r 2 fall -off as light is distributed 
over a larger effective sphere the further away it is (Cohen and Wallace 1993; Sillion and 
Puech 1994; Glassner 1995). A large linear system can then be set up to solve for the final 
lightness of each area patch, using the light sources as the forcing function (right hand side). 
Once the system has been solved, the scene can be rendered from any desired point of view. 
Under certain circumstances, it is possible to recover the global illumination in a scene from 
photographs using computer vision techniques (Yu, Debevec, Malik et al. 1999). 

The basic radiosity algorithm does not take into account certain near field effects, such 
as the darkening inside corners and scratches, or the limited ambient illumination caused 
by partial shadowing from other surfaces. Such effects have been exploited in a number of 
computer vision algorithms (Nayar, Ikeuchi, and Kanade 1991; Langer and Zucker 1994). 

While all of these global illumination effects can have a strong effect on the appearance 
of a scene, and hence its 3D interpretation, they are not covered in more detail in this book. 
(But see Section 12.7.1 for a discussion of recovering BRDFs from real scenes and objects.) 


2.2.3 Optics 

Once the light from a scene reaches the camera, it must still pass through the lens before 
reaching the sensor (analog film or digital silicon). For many applications, it suffices to 
treat the lens as an ideal pinhole that simply projects all rays through a common center of 
projection (Figures 2.8 and 2.9). 

However, if we want to deal with issues such as focus, exposure, vignetting, and aber- 
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Figure 2.19 A thin lens of focal length / focuses the light from a plane a distance z 0 in front 
of the lens at a distance Zj behind the lens, where } + } = j. . If the focal plane (vertical 
gray line next to c) is moved forward, the images are no longer in focus and the circle of 
confusion c (small thick line segments) depends on the distance of the image plane motion 
A zi relative to the lens aperture diameter d. The field of view (f.o.v.) depends on the ratio 
between the sensor width W and the focal length / (or, more precisely, the focusing distance 
Zi, which is usually quite close to /). 

ration, we need to develop a more sophisticated model, which is where the study of optics 
comes in (Moller 1988; Hecht 2001; Ray 2002). 

Figure 2.19 shows a diagram of the most basic lens model, i.e., the thin lens composed 
of a single piece of glass with very low, equal curvature on both sides. According to the 
lens law (which can be derived using simple geometric arguments on light ray refraction), the 
relationship between the distance to an object z 0 and the distance behind the lens at which a 
focused image is formed z % can be expressed as 



where / is called the focal length of the lens. If we let z 0 — > oo, i.e., we adjust the lens (move 
the image plane) so that objects at infinity are in focus, we get Zi = /, which is why we can 
think of a lens of focal length / as being equivalent (to a first approximation) to a pinhole a 
distance / from the focal plane (Figure 2.10), whose field of view is given by (2.60). 

If the focal plane is moved away from its proper in-focus setting of Z{ (e.g., by twisting 
the focus ring on the lens), objects at z a are no longer in focus, as shown by the gray plane in 
Figure 2.19. The amount of mis-focus is measured by the circle of confusion c (shown as short 
thick blue line segments on the gray plane). 7 The equation for the circle of confusion can be 
derived using similar triangles; it depends on the distance of travel in the focal plane A Zi 
relative to the original focus distance z t and the diameter of the aperture d (see Exercise 2.4). 

7 If the aperture is not completely circular, e.g., if it is caused by a hexagonal diaphragm, it is sometimes possible 
to see this effect in the actual blur function (Levin, Fergus, Durand et at. 2007; Joshi, Szeliski, and Kriegman 2008) 
or in the “glints” that are seen when shooting into the sun. 
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Figure 2.20 Regular and zoom lens depth of field indicators. 

The allowable depth variation in the scene that limits the circle of confusion to an accept- 
able number is commonly called the depth of field and is a function of both the focus distance 
and the aperture, as shown diagrammatically by many lens markings (Figure 2.20). Since this 
depth of field depends on the aperture diameter d, we also have to know how this varies with 
the commonly displayed /-number, which is usually denoted as // fi or N and is defined as 

f/ * = N= { ’ (Z97) 
where the focal length / and the aperture diameter d are measured in the same unit (say, 
millimeters). 

The usual way to write the f-number is to replace the in //# with the actual number, 
i.e., //1.4, // 2, // 2.8, . . . , // 22. (Alternatively, we can say N = 1.4, etc.) An easy way to 
interpret these numbers is to notice that dividing the focal length by the f-number gives us the 
diameter d, so these are just formulas for the aperture diameter. 8 

Notice that the usual progression for f-numbers is in full stops , which are multiples of >/2, 
since this corresponds to doubling the area of the entrance pupil each time a smaller f-number 
is selected. (This doubling is also called changing the exposure by one exposure value or EV. 
It has the same effect on the amount of light reaching the sensor as doubling the exposure 
duration, e.g., from l /i 25 to Y 250 , see Exercise 2.5.) 

Now that you know how to convert between f-numbers and aperture diameters, you can 
construct your own plots for the depth of field as a function of focal length /, circle of 
confusion c, and focus distance z 0 , as explained in Exercise 2.4 and see how well these match 
what you observe on actual lenses, such as those shown in Figure 2.20. 

Of course, real lenses are not infinitely thin and therefore suffer from geometric aber- 
rations, unless compound elements are used to correct for them. The classic five Seidel 
aberrations, which arise when using third-order optics, include spherical aberration, coma, 
astigmatism, curvature of field, and distortion (Moller 1988; Hecht 2001; Ray 2002). 



This also explains why, with zoom lenses, the f-number varies with the current zoom (focal length) setting. 
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Figure 2.21 In a lens subject to chromatic aberration, light at different wavelengths (e.g., 
the red and blur arrows) is focused with a different focal length f and hence a different depth 
z\, resulting in both a geometric (in-plane) displacement and a loss of focus. 


Chromatic aberration 

Because the index of refraction for glass varies slightly as a function of wavelength, sim- 
ple lenses suffer from chromatic aberration, which is the tendency for light of different 
colors to focus at slightly different distances (and hence also with slightly different mag- 
nification factors), as shown in Figure 2.21. The wavelength-dependent magnification fac- 
tor, i.e., the transverse chromatic aberration, can be modeled as a per-color radial distortion 
(Section 2.1.6) and, hence, calibrated using the techniques described in Section 6.3.5. The 
wavelength-dependent blur caused by longitudinal chromatic aberration can be calibrated 
using techniques described in Section 10. 1.4. Unfortunately, the blur induced by longitudinal 
aberration can be harder to undo, as higher frequencies can get strongly attenuated and hence 
hard to recover. 

In order to reduce chromatic and other kinds of aberrations, most photographic lenses 
today are compound lenses made of different glass elements (with different coatings). Such 
lenses can no longer be modeled as having a single nodal point P through which all of the 
rays must pass (when approximating the lens with a pinhole model). Instead, these lenses 
have both a front nodal point, through which the rays enter the lens, and a rear nodal point, 
through which they leave on their way to the sensor. In practice, only the location of the front 
nodal point is of interest when performing careful camera calibration, e.g., when determining 
the point around which to rotate to capture a parallax-free panorama (see Section 9.1.3). 

Not all lenses, however, can be modeled as having a single nodal point. In particular, very 
wide-angle lenses such as fisheye lenses (Section 2.1.6) and certain catadioptric imaging 
systems consisting of lenses and curved mirrors (Baker and Nayar 1999) do not have a single 
point through which all of the acquired light rays pass. In such cases, it is preferable to 
explicitly construct a mapping function (look-up table) between pixel coordinates and 3D 
rays in space (Gremban, Thorpe, and Kanade 1988; Champleboux, Lavallee, Sautot et al. 
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Figure 2.22 The amount of light hitting a pixel of surface area Si depends on the square of 
the ratio of the aperture diameter d to the focal length /, as well as the fourth power of the 


1992; Grossberg and Nayar 2001; Sturm and Ramalingam 2004; Tardif, Sturm, Trudeau et 
al. 2009), as mentioned in Section 2.1.6. 


Another property of real-world lenses is vignetting, which is the tendency for the brightness 
of the image to fall off towards the edge of the image. 

Two kinds of phenomena usually contribute to this effect (Ray 2002). The first is called 
natural vignetting and is due to the foreshortening in the object surface, projected pixel, and 
lens aperture, as shown in Figure 2.22. Consider the light leaving the object surface patch 
of size So located at an ojf-axis angle a. Because this patch is foreshortened with respect 
to the camera lens, the amount of light reaching the lens is reduced by a factor cos a. The 
amount of light reaching the lens is also subject to the usual 1/r 2 fall-off; in this case, the 
distance r 0 = z Q / cos a. The actual area of the aperture through which the light passes 
is foreshortened by an additional factor cos a, i.e., the aperture as seen from point O is an 
ellipse of dimensions d x d cos a. Putting all of these factors together, we see that the amount 
of light leaving O and passing through the aperture on its way to the image pixel located at I 
is proportional to 


Since triangles A OPQ and A IP J are similar, the projected areas of of the object surface So 
and image pixel Si are in the same (squared) ratio as z r> : z,, 


off-axis angle a cosine, cos 4 a. 


Vignetting 



(2.98) 



(2.99) 


Putting these together, we obtain the final relationship between the amount of light reaching 
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pixel i and the aperture diameter d, the focusing distance z, « /, and the off-axis angle a, 


which is called the fundamental radiometric relation between the scene radiance L and the 
light (irradiance) E reaching the pixel sensor. 


(Horn 1986; Nalwa 1993; Hecht 2001; Ray 2002). Notice in this equation how the amount of 
light depends on the pixel surface area (which is why the smaller sensors in point-and-shoot 
cameras are so much noisier than digital single lens reflex (SLR) cameras), the inverse square 
of the f-stop N = f/d (2.97), and the fourth power of the cos 4 a off-axis fall-off, which is 
the natural vignetting term. 

The other major kind of vignetting, called mechanical vignetting, is caused by the internal 
occlusion of rays near the periphery of lens elements in a compound lens, and cannot easily 
be described mathematically without performing a full ray-tracing of the actual lens design. 9 
However, unlike natural vignetting, mechanical vignetting can be decreased by reducing the 
camera aperture (increasing the f-number). It can also be calibrated (along with natural vi- 
gnetting) using special devices such as integrating spheres, uniformly illuminated targets, or 
camera rotation, as discussed in Section 10.1.3. 


After starting from one or more light sources, reflecting off one or more surfaces in the world, 
and passing through the camera’s optics (lenses), light finally reaches the imaging sensor. 
How are the photons arriving at this sensor converted into the digital (R, G, B) values that 
we observe when we look at a digital image? In this section, we develop a simple model 
that accounts for the most important effects such as exposure (gain and shutter speed), non- 
linear mappings, sampling and aliasing, and noise. Figure 2.23, which is based on camera 
models developed by Healey and Kondepudy (1994); Tsin, Ramesh, and Kanade (2001); Liu, 
Szeliski, Kang et al. (2008), shows a simple version of the processing stages that occur in 
modern digital cameras. Chakrabarti, Scharstein, and Zickler (2009) developed a sophisti- 
cated 24-parameter model that is an even better match to the processing performed in today’s 
cameras. 

9 There are some empirical models that work well in practice (Kang and Weiss 2000; Zheng, Lin, and Kang 
2006). 
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Figure 2.23 Image sensing pipeline, showing the various sources of noise as well as typical 
digital post-processing steps. 


Light falling on an imaging sensor is usually picked up by an active sensing area, inte- 
grated for the duration of the exposure (usually expressed as the shutter speed in a fraction of 
a second, e.g., ^ ) , and then passed to a set of sense amplifiers . The two main kinds 

of sensor used in digital still and video cameras today are charge-coupled device (CCD) and 
complementary metal oxide on silicon (CMOS). 

In a CCD, photons are accumulated in each active well during the exposure time. Then, 
in a transfer phase, the charges are transferred from well to well in a kind of “bucket brigade” 
until they are deposited at the sense amplifiers, which amplify the signal and pass it to 
an analog-to-digital converter (ADC). 10 Older CCD sensors were prone to blooming, when 
charges from one over-exposed pixel spilled into adjacent ones, but most newer CCDs have 
anti-blooming technology (“troughs” into which the excess charge can spill). 

In CMOS, the photons hitting the sensor directly affect the conductivity (or gain) of a 
photodetector, which can be selectively gated to control exposure duration, and locally am- 
plified before being read out using a multiplexing scheme. Traditionally, CCD sensors 
outperformed CMOS in quality sensitive applications, such as digital SLRs, while CMOS 
was better for low-power applications, but today CMOS is used in most digital cameras. 

The main factors affecting the performance of a digital image sensor are the shutter speed, 
sampling pitch, fill factor, chip size, analog gain, sensor noise, and the resolution (and quality) 

10 In digital still cameras, a complete frame is captured and then read out sequentially at once. However, if video 
is being captured, a rolling shutter , which exposes and transfers each line separately, is often used. In older video 
cameras, the even fields (lines) were scanned first, followed by the odd fields, in a process that is called interlacing. 
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of the analog-to-digital converter. Many of the actual values for these parameters can be read 
from the EXIF tags embedded with digital images, while others can be obtained from the 
camera manufacturers’ specification sheets or from camera review or calibration Web sites. 1 1 

Shutter speed. The shutter speed (exposure time) directly controls the amount of light 
reaching the sensor and, hence, determines if images are under- or over-exposed. (For bright 
scenes, where a large aperture or slow shutter speed are desired to get a shallow depth of field 
or motion blur, neutral density filters are sometimes used by photographers.) For dynamic 
scenes, the shutter speed also determines the amount of motion blur in the resulting picture. 
Usually, a higher shutter speed (less motion blur) makes subsequent analysis easier (see Sec- 
tion 10.3 for techniques to remove such blur). However, when video is being captured for 
display, some motion blur may be desirable to avoid stroboscopic effects. 

Sampling pitch. The sampling pitch is the physical spacing between adjacent sensor cells 
on the imaging chip. A sensor with a smaller sampling pitch has a higher sampling density and 
hence provides a higher resolution (in terms of pixels) for a given active chip area. However, 
a smaller pitch also means that each sensor has a smaller area and cannot accumulate as many 
photons; this makes it not as light sensitive and more prone to noise. 

Fill factor. The fill factor is the active sensing area size as a fraction of the theoretically 
available sensing area (the product of the horizontal and vertical sampling pitches). Higher 
fill factors are usually preferable, as they result in more light capture and less aliasing (see 
Section 2.3.1). However, this must be balanced with the need to place additional electronics 
between the active sense areas. The fill factor of a camera can be determined empirically 
using a photometric camera calibration process (see Section 10.1.4). 

Chip size. Video and point-and-shoot cameras have traditionally used small chip areas (|- 
inch to "inch sensors 12 ), while digital SLR cameras try to come closer to the traditional size 
of a 35mm film frame. 13 When overall device size is not important, having a larger chip 
size is preferable, since each sensor cell can be more photo-sensitive. (For compact cameras, 
a smaller chip means that all of the optics can be shrunk down proportionately.) However, 

1 1 http://www.clarkvision.com/imagedetail/digital.sensor.perfoiTnance.summary/ . 

12 These numbers refer to the “tube diameter” of the old vidicon tubes used in video cameras (http://www. 
dpreview.com/learn/?/Glossary/Camera_System/sensor_sizes_01.htm). The 1/2.5” sensor on the Canon SD800 cam- 
era actually measures 5.76mm X 4.29mm, i.e., a sixth of the size (on side) of a 35mm full-frame (36mm x 24mm) 
DSLR sensor. 

13 When a DSLR chip does not fill the 35mm full frame, it results in a multiplier effect on the lens focal length. 
For example, a chip that is only 0.6 the dimension of a 35mm frame will make a 50mm lens image the same angular 
extent as a 50/0.6 = 50 x 1.6 =80mm lens, as demonstrated in (2.60). 
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larger chips are more expensive to produce, not only because fewer chips can be packed into 
each wafer, but also because the probability of a chip defect goes up linearly with the chip 
area. 

Analog gain. Before analog-to-digital conversion, the sensed signal is usually boosted by 
a sense amplifier. In video cameras, the gain on these amplifiers was traditionally controlled 
by automatic gain control (AGC) logic, which would adjust these values to obtain a good 
overall exposure. In newer digital still cameras, the user now has some additional control 
over this gain through the ISO setting , which is typically expressed in ISO standard units 
such as 100, 200, or 400. Since the automated exposure control in most cameras also adjusts 
the aperture and shutter speed, setting the ISO manually removes one degree of freedom from 
the camera’s control, just as manually specifying aperture and shutter speed does. In theory, a 
higher gain allows the camera to perform better under low light conditions (less motion blur 
due to long exposure times when the aperture is already maxed out). In practice, however, 
higher ISO settings usually amplify the sensor noise. 

Sensor noise. Throughout the whole sensing process, noise is added from various sources, 
which may include fixed pattern noise, dark current noise, shot noise, amplifier noise and 
quantization noise (Healey and Kondepudy 1994; Tsin, Ramesh, and Kanade 2001). The 
final amount of noise present in a sampled image depends on all of these quantities, as well 
as the incoming light (controlled by the scene radiance and aperture), the exposure time, and 
the sensor gain. Also, for low light conditions where the noise is due to low photon counts, a 
Poisson model of noise may be more appropriate than a Gaussian model. 

As discussed in more detail in Section 10.1.1, Liu, Szeliski, Kang el al. (2008) use this 
model, along with an empirical database of camera response functions (CRFs) obtained by 
Grossberg and Nayar (2004), to estimate the noise level function (NLF) for a given image, 
which predicts the overall noise variance at a given pixel as a function of its brightness (a 
separate NLF is estimated for each color channel). An alternative approach, when you have 
access to the camera before taking pictures, is to pre-calibrate the NLF by taking repeated 
shots of a scene containing a variety of colors and luminances, such as the Macbeth Color 
Chart shown in Figure 10.3b (McCamy, Marcus, and Davidson 1976). (When estimating 
the variance, be sure to throw away or downweight pixels with large gradients, as small 
shifts between exposures will affect the sensed values at such pixels.) Unfortunately, the pre- 
calibration process may have to be repeated for different exposure times and gain settings 
because of the complex interactions occurring within the sensing system. 

In practice, most computer vision algorithms, such as image denoising, edge detection, 
and stereo matching, all benefit from at least a rudimentary estimate of the noise level. Barring 
the ability to pre-calibrate the camera or to take repeated shots of the same scene, the simplest 
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approach is to look for regions of near-constant value and to estimate the noise variance in 
such regions (Liu, Szeliski, Kang et al. 2008). 


ADC resolution. The final step in the analog processing chain occurring within an imaging 
sensor is the analog to digital conversion (ADC). While a variety of techniques can be used 
to implement this process, the two quantities of interest are the resolution of this process 
(how many bits it yields) and its noise level (how many of these bits are useful in practice). 
For most cameras, the number of bits quoted (eight bits for compressed JPEG images and a 
nominal 16 bits for the RAW formats provided by some DSLRs) exceeds the actual number 
of usable bits. The best way to tell is to simply calibrate the noise of a given sensor, e.g., 
by taking repeated shots of the same scene and plotting the estimated noise as a function of 
brightness (Exercise 2.6). 


Digital post-processing. Once the irradiance values arriving at the sensor have been con- 
verted to digital bits, most cameras perform a variety of digital signal processing (DSP) 
operations to enhance the image before compressing and storing the pixel values. These in- 
clude color filter array (CFA) demosaicing, white point setting, and mapping of the luminance 
values through a gamma function to increase the perceived dynamic range of the signal. We 
cover these topics in Section 2.3.2 but, before we do, we return to the topic of aliasing, which 
was mentioned in connection with sensor array fill factors. 


2.3.1 Sampling and aliasing 

What happens when a field of light impinging on the image sensor falls onto the active sense 
areas in the imaging chip? The photons arriving at each active cell are integrated and then 
digitized. However, if the fill factor on the chip is small and the signal is not otherwise 
band-limited , visually unpleasing aliasing can occur. 

To explore the phenomenon of aliasing, let us first look at a one-dimensional signal (Fig- 
ure 2.24), in which we have two sine waves, one at a frequency of f — 3 /4 and the other at 
/ = 5 A- If we sample these two signals at a frequency of / = 2, we see that they produce 
the same samples (shown in black), and so we say that they are aliased . 14 Why is this a bad 
effect? In essence, we can no longer reconstmct the original signal, since we do not know 
which of the two original frequencies was present. 

In fact. Shannon’s Sampling Theorem shows that the minimum sampling (Oppenheim 
and Schafer 1996; Oppenheim, Schafer, and Buck 1999) rate required to reconstruct a signal 

14 An alias is an alternate name for someone, so the sampled signal corresponds to two different aliases. 
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Figure 2.24 Aliasing of a one -dimensional signal: The blue sine wave at / = 3/4 and the 
red sine wave at / = 5/4 have the same digital samples, when sampled at / = 2. Even after 
convolution with a 100% fill factor box filter, the two signals, while no longer of the same 
magnitude, are still aliased in the sense that the sampled red signal looks like an inverted 
lower magnitude version of the blue signal. (The image on the right is scaled up for better 
visibility. The actual sine magnitudes are 30% and —18% of their original values.) 

from its instantaneous samples must be at least twice the highest frequency, 15 


fs > 2/max- (2.102) 

The maximum frequency in a signal is known as the Nyquist frequency and the inverse of the 
minimum sampling frequency r s = l// s is known as the Nyquist rate. 

However, you may ask, since an imaging chip actually averages the light field over a 
finite area, are the results on point sampling still applicable? Averaging over the sensor area 
does tend to attenuate some of the higher frequencies. However, even if the fill factor is 
100%, as in the right image of Figure 2.24, frequencies above the Nyquist limit (half the 
sampling frequency) still produce an aliased signal, although with a smaller magnitude than 
the corresponding band-limited signals. 

A more convincing argument as to why aliasing is bad can be seen by downsampling 
a signal using a poor quality filter such as a box (square) filter. Figure 2.25 shows a high- 
frequency chirp image (so called because the frequencies increase over time), along with the 
results of sampling it with a 25% fill-factor area sensor, a 100% fill-factor sensor, and a high- 
quality 9-tap filter. Additional examples of downsampling ( decimation ) filters can be found 
in Section 3.5.2 and Figure 3.30. 

The best way to predict the amount of aliasing that an imaging system (or even an image 
processing algorithm) will produce is to estimate the point spread function (PSF), which 
represents the response of a particular pixel sensor to an ideal point light source. The PSF 
is a combination (convolution) of the blur induced by the optical system (lens) and the finite 
integration area of a chip sensor. 16 

15 The actual theorem states that f s must be at least twice the signal bandwidth but, since we are not dealing with 
modulated signals such as radio waves during image capture, the maximum frequency suffices. 

16 Imaging chips usually interpose an optical anti-aliasing filter just before the imaging chip to reduce or control 
the amount of aliasing. 


2.3 The digital camera 


79 



(a) (b) (c) fd) 


Figure 2.25 Aliasing of a two-dimensional signal: (a) original full-resolution image; (b) 
downsampled 4x with a 25% fill factor box filter; (c) downsampled 4x with a 100% fill 
factor box filter; (d) downsampled 4x with a high-quality 9-tap filter. Notice how the higher 
frequencies are aliased into visible frequencies with the lower quality filters, while the 9-tap 
filter completely removes these higher frequencies. 


If we know the blur function of the lens and the fill factor (sensor area shape and spacing) 
for the imaging chip (plus, optionally, the response of the anti-aliasing filter), we can convolve 
these (as described in Section 3.2) to obtain the PSF. Figure 2.26a shows the one-dimensional 
cross-section of a PSF for a lens whose blur function is assumed to be a disc of a radius 
equal to the pixel spacing s plus a sensing chip whose horizontal fill factor is 80%. Taking 
the Fourier transform of this PSF (Section 3.4), we obtain the modulation transfer function 
(MTF), from which we can estimate the amount of aliasing as the area of the Fourier magni- 
tude outside the f < f s Nyquist frequency. 17 If we de-focus the lens so that the blur function 
has a radius of 2s (Figure 2.26c), we see that the amount of aliasing decreases significantly, 
but so does the amount of image detail (frequencies closer to f = f s ). 

Under laboratory conditions, the PSF can be estimated (to pixel precision) by looking at a 
point light source such as a pin hole in a black piece of cardboard lit from behind. However, 
this PSF (the actual image of the pin hole) is only accurate to a pixel resolution and, while 
it can model larger blur (such as blur caused by defocus), it cannot model the sub-pixel 
shape of the PSF and predict the amount of aliasing. An alternative technique, described in 
Section 10.1.4, is to look at a calibration pattern (e.g., one consisting of slanted step edges 
(Reichenbach, Park, and Narayanswamy 1991; Williams and Burns 2001; Joshi, Szeliski, and 
Kriegman 2008)) whose ideal appearance can be re-synthesized to sub-pixel precision. 

In addition to occurring during image acquisition, aliasing can also be introduced in var- 
ious image processing operations, such as resampling, upsampling, and downsampling. Sec- 
tions 3.4 and 3.5.2 discuss these issues and show how careful selection of filters can reduce 

17 The complex Fourier transform of the PSF is actually called the optical transfer function (OTF) (Williams 
1999). Its magnitude is called the modulation transfer function (MTF) and its phase is called the phase transfer 
function (PTF). 
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Figure 2.26 Sample point spread functions (PSF): The diameter of the blur disc (blue) in 
(a) is equal to half the pixel spacing, while the diameter in (c) is twice the pixel spacing. The 
horizontal fill factor of the sensing chip is 80% and is shown in brown. The convolution of 
these two kernels gives the point spread function, shown in green. The Fourier response of 
the PSF (the MTF) is plotted in (b) and (d). The area above the Nyquist frequency where 
aliasing occurs is shown in red. 


the amount of aliasing that operations inject. 


2.3.2 Color 

In Section 2.2, we saw how lighting and surface reflections are functions of wavelength. 
When the incoming light hits the imaging sensor, light from different parts of the spectrum is 
somehow integrated into the discrete red, green, and blue (RGB) color values that we see in 
a digital image. How does this process work and how can we analyze and manipulate color 
values? 

You probably recall from your childhood days the magical process of mixing paint colors 
to obtain new ones. You may recall that blue+yellow makes green, red+blue makes purple, 
and red+green makes brown. If you revisited this topic at a later age, you may have learned 
that the proper subtractive primaries are actually cyan (a light blue-green), magenta (pink), 
and yellow (Figure 2.27b), although black is also often used in four-color printing (CMYK). 
(If you ever subsequently took any painting classes, you learned that colors can have even 
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(a) (b) 


Figure 2.27 Primary and secondary colors: (a) additive colors red, green, and blue can be 
mixed to produce cyan, magenta, yellow, and white; (b) subtractive colors cyan, magenta, 
and yellow can be mixed to produce red, green, blue, and black. 


more fanciful names, such as alizarin crimson, cemlean blue, and chartreuse.) The subtractive 
colors are called subtractive because pigments in the paint absorb certain wavelengths in the 
color spectrum. 

Later on, you may have learned about the additive primary colors (red, green, and blue) 
and how they can be added (with a slide projector or on a computer monitor) to produce cyan, 
magenta, yellow, white, and all the other colors we typically see on our TV sets and monitors 
(Figure 2.27a). 

Through what process is it possible for two different colors, such as red and green, to 
interact to produce a third color like yellow? Are the wavelengths somehow mixed up to 
produce a new wavelength? 

You probably know that the correct answer has nothing to do with physically mixing 
wavelengths. Instead, the existence of three primaries is a result of the tri-stimulus (or tri- 
chromatic) nature of the human visual system, since we have three different kinds of cone, 
each of which responds selectively to a different portion of the color spectrum (Glassner 1995; 
Wyszecki and Stiles 2000; Fairchild 2005; Reinhard, Ward, Pattanaik et al. 2005; Livingstone 
2008). 18 Note that for machine vision applications, such as remote sensing and terrain clas- 
sification, it is preferable to use many more wavelengths. Similarly, surveillance applications 
can often benefit from sensing in the near-infrared (NIR) range. 

CIE RGB and XYZ 

To test and quantify the tri-chromatic theory of perception, we can attempt to reproduce all 
monochromatic (single wavelength) colors as a mixture of three suitably chosen primaries. 

18 See also Mark Fairchild’s Web page, http://www.cis.rit.edu/fairchild/WhyIsColor/books_links.html. 
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(b) 

Figure 2.28 Standard CIE color matching functions: (a) r( A), g( A), 6(A) color spectra 
obtained from matching pure colors to the R=700.0nm, G=546.1nm, and B=435.8nm pri- 
maries; (b) x(A), y( A), 2 (A) color matching functions, which are linear combinations of the 
(r(A), <?(A), 6(A)) spectra. 

(Pure wavelength light can be obtained using either a prism or specially manufactured color 
filters.) In the 1930s, the Commission Internationale d’Eclairage (CIE) standardized the RGB 
representation by performing such color matching experiments using the primary colors of 
red (700. Onm wavelength), green (546. lnm), and blue (435. 8nm). 

Figure 2.28 shows the results of performing these experiments with a standard observer , 
i.e., averaging perceptual results over a large number of subjects. You will notice that for 
certain pure spectra in the blue-green range, a negative amount of red light has to be added, 
i.e., a certain amount of red has to be added to the color being matched in order to get a color 
match. These results also provided a simple explanation for the existence of metamers , which 
are colors with different spectra that are perceptually indistinguishable. Note that two fabrics 
or paint colors that are metamers under one light may no longer be so under different lighting. 


Because of the problem associated with mixing negative light, the CIE also developed a 
new color space called XYZ, which contains all of the pure spectral colors within its positive 
octant. (It also maps the Y axis to the luminance, i.e., perceived relative brightness, and maps 
pure white to a diagonal (equal-valued) vector.) The transformation from RGB to XYZ is 
given by 


X 

Y 

Z 


1 

0.17697 


0.49 0.31 0.20 

0.17697 0.81240 0.01063 
0.00 0.01 0.99 



’ R ’ 


G 


B 


(2.103) 


While the official definition of the CIE XYZ standard has the matrix normalized so that the 
Y value corresponding to pure red is 1, a more commonly used form is to omit the leading 
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Figure 2.29 CIE chromaticity diagram, showing colors and their corresponding (x, y) val- 
ues. Pure spectral colors are arranged around the outside of the curve. 


fraction, so that the second row adds up to one, i.e., the RGB triplet (1, 1, 1) maps to a Y value 
of 1. Linearly blending the (?“(A), <j(A), 6(A)) curves in Figure 2.28a according to (2.103), we 
obtain the resulting (ir(A), y(A), 5(A)) curves shown in Figure 2.28b. Notice how all three 
spectra (color matching functions) now have only positive values and how the y( A) curve 
matches that of the luminance perceived by humans. 

If we divide the XYZ values by the sum of X+Y+Z, we obtain the chromaticity coordi- 
nates 


X _ Y _ _ 2 

X + Y + Z ’ y ~ X + Y + Z' ~ - X + Y+Z 1 


(2.104) 


which sum up to 1. The chromaticity coordinates discard the absolute intensity of a given 
color sample and just represent its pure color. If we sweep the monochromatic color A pa- 
rameter in Figure 2.28b from A = 380nm to A = 800nm, we obtain the familiar chromaticity 
diagram shown in Figure 2.29. This figure shows the (x, y) value for every color value per- 
ceivable by most humans. (Of course, the CMYK reproduction process in this book does not 
actually span the whole gamut of perceivable colors.) The outer curved rim represents where 
all of the pure monochromatic color values map in (x, y) space, while the lower straight line, 
which connects the two endpoints, is known as the purple line. 

A convenient representation for color values, when we want to tease apart luminance 
and chromaticity, is therefore Yxy (luminance plus the two most distinctive chrominance 
components). 


L*a*b* color space 

While the XYZ color space has many convenient properties, including the ability to separate 
luminance from chrominance, it does not actually predict how well humans perceive differ- 
ences in color or luminance. 
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Because the response of the human visual system is roughly logarithmic (we can perceive 
relative luminance differences of about 1%), the CIE defined a non-linear re -mapping of the 
XYZ space called L*a*b* (also sometimes called CIELAB), where differences in luminance 
or chrominance are more perceptually uniform. 19 
The L* component of lightness is defined as 


L* = 116/ 



(2.105) 


where Y n is the luminance value for nominal white (Fairchild 2005) and 


f +1/3 + > A3 

^ { t/(3S 2 ) + 2,5/3 else, (2.106) 

is a finite-slope approximation to the cube root with 5 = 6/29. The resulting 0 . . . 100 scale 
roughly measures equal amounts of lightness perceptibility. 

In a similar fashion, the a* and b* components are defined as 


a* = 500 




and b* = 200 




(2.107) 


where again, (X n ,Y n , Z n ) is the measured white point. Figure 2.32i-k show the L*a*b* 
representation for a sample color image. 


Color cameras 

While the preceding discussion tells us how we can uniquely describe the perceived tri- 
stimulus description of any color (spectral distribution), it does not tell us how RGB still 
and video cameras actually work. Do they just measure the amount of light at the nominal 
wavelengths of red (700. Onm), green (546. lnm), and blue (435. 8nm)? Do color monitors just 
emit exactly these wavelengths and, if so, how can they emit negative red light to reproduce 
colors in the cyan range? 

In fact, the design of RGB video cameras has historically been based around the availabil- 
ity of colored phosphors that go into television sets. When standard-definition color television 
was invented (NTSC), a mapping was defined between the RGB values that would drive the 
three color guns in the cathode ray tube (CRT) and the XYZ values that unambiguously de- 
fine perceived color (this standard was called ITU-R BT.601). With the advent of HDTV and 
newer monitors, a new standard called ITU-R BT.709 was created, which specifies the XYZ 

19 Another perceptually motivated color space called L*u*v* was developed and standardized simultaneously 
(Fairchild 2005). 
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values of each of the color primaries. 


X ' 


’ 0.412453 0.357580 0.180423 ' 


R-709 

Y 

= 

0.212671 0.715160 0.072169 


G709 

Z 


0.019334 0.119193 0.950227 


^709 


In practice, each color camera integrates light according to the spectral response function 
of its red, green, and blue sensors. 


R = J L(X)S R (X)dX, 
G = J L(X)S G (X)dX , 
B = J L(X)S B (X)dX, 


(2.109) 


where L( X) is the incoming spectrum of light at a given pixel and {S(r(A), S g ( A), 5b(A)} 
are the red, green, and blue spectral sensitivities of the corresponding sensors. 

Can we tell what spectral sensitivities the cameras actually have? Unless the camera 
manufacturer provides us with this data or we observe the response of the camera to a whole 
spectrum of monochromatic lights, these sensitivities are not specified by a standard such as 
BT.709. Instead, all that matters is that the tri-stimulus values for a given color produce the 
specified RGB values. The manufacturer is free to use sensors with sensitivities that do not 
match the standard XYZ definitions, so long as they can later be converted (through a linear 
transform) to the standard colors. 

Similarly, while TV and computer monitors are supposed to produce RGB values as spec- 
ified by Equation (2.108), there is no reason that they cannot use digital logic to transform the 
incoming RGB values into different signals to drive each of the color channels. Properly cal- 
ibrated monitors make this information available to software applications that perform color 
management, so that colors in real life, on the screen, and on the printer all match as closely 
as possible. 


Color filter arrays 

While early color TV cameras used three vidicons (tubes) to perform their sensing and later 
cameras used three separate RGB sensing chips, most of today’s digital still and video cam- 
eras cameras use a color filter array (CFA), where alternating sensors are covered by different 
colored filters. 20 

20 A newer chip design by Foveon (http://www.foveon.com) stacks the red, green, and blue sensors beneath each 
other, but it has not yet gained widespread adoption. 
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Figure 2.30 Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel values, 
with unknown (guessed) values shown as lower case. 


The most commonly used pattern in color cameras today is the Bayer pattern (Bayer 
1976), which places green filters over half of the sensors (in a checkerboard pattern), and red 
and blue filters over the remaining ones (Figure 2.30). The reason that there are twice as many 
green filters as red and blue is because the luminance signal is mostly determined by green 
values and the visual system is much more sensitive to high frequency detail in luminance 
than in chrominance (a fact that is exploited in color image compression — see Section 2.3.3). 
The process of interpolating the missing color values so that we have valid RGB values for 
all the pixels is known as demosaicing and is covered in detail in Section 10.3.1. 

Similarly, color LCD monitors typically use alternating stripes of red, green, and blue 
filters placed in front of each liquid crystal active area to simulate the experience of a full color 
display. As before, because the visual system has higher resolution (acuity) in luminance than 
chrominance, it is possible to digitally pre-filter RGB (and monochrome) images to enhance 
the perception of crispness (Betrisey, Blinn, Dresevic et al. 2000; Platt 2000). 

Color balance 

Before encoding the sensed RGB values, most cameras perform some kind of color balancing 
operation in an attempt to move the white point of a given image closer to pure white (equal 
RGB values). If the color system and the illumination are the same (the BT.709 system uses 
the daylight illuminant D 65 as its reference white), the change may be minimal. However, 
if the illuminant is strongly colored, such as incandescent indoor lighting (which generally 
results in a yellow or orange hue), the compensation can be quite significant. 

A simple way to perform color correction is to multiply each of the RGB values by a 
different factor (i.e., to apply a diagonal matrix transform to the RGB color space). More 
complicated transforms, which are sometimes the result of mapping to XYZ space and back. 
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Figure 2.31 Gamma compression: (a) The relationship between the input signal luminance 
Y and the transmitted signal Y' is given by Y' = Y 1 ' 1 . (b) At the receiver, the signal Y' is 
exponentiated by the factor 7 ,Y = Y' 7 . Noise introduced during transmission is squashed in 
the dark regions, which corresponds to the more noise-sensitive region of the visual system. 

actually perform a color twist, i.e., they use a general 3x3 color transform matrix . 21 Exer- 
cise 2.9 has you explore some of these issues. 

Gamma 

In the early days of black and white television, the phosphors in the CRT used to display 
the TV signal responded non-linearly to their input voltage. The relationship between the 
voltage and the resulting brightness was characterized by a number called gamma ( 7 ), since 
the formula was roughly 

B = V 7 , (2.110) 

with a 7 of about 2.2. To compensate for this effect, the electronics in the TV camera would 
pre-map the sensed luminance Y through an inverse gamma, 

Y’ = Y$, (2.111) 


with a typical value of 2 = 0.45. 

The mapping of the signal through this non-linearity before transmission had a beneficial 
side effect: noise added during transmission (remember, these were analog days!) would be 
reduced (after applying the gamma at the receiver) in the darker regions of the signal where 
it was more visible (Figure 2.3 1). 22 (Remember that our visual system is roughly sensitive to 
relative differences in luminance.) 

- 1 Those of you old enough to remember the early days of color television will naturally think of the hue adjustment 
knob on the television set, which could produce truly bizarre results. 

22 A related technique called companding was the basis of the Dolby noise reduction systems used with audio 
tapes. 
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When color television was invented, it was decided to separately pass the red, green, and 
blue signals through the same gamma non-linearity before combining them for encoding. 
Today, even though we no longer have analog noise in our transmission systems, signals are 
still quantized during compression (see Section 2.3.3), so applying inverse gamma to sensed 
values is still useful. 

Unfortunately, for both computer vision and computer graphics, the presence of gamma 
in images is often problematic. For example, the proper simulation of radiometric phenomena 
such as shading (see Section 2.2 and Equation (2.87)) occurs in a linear radiance space. Once 
all of the computations have been performed, the appropriate gamma should be applied before 
display. Unfortunately, many computer graphics systems (such as shading models) operate 
directly on RGB values and display these values directly. (Fortunately, newer color imaging 
standards such as the 16-bit scRGB use a linear space, which makes this less of a problem 
(Glassner 1995).) 

In computer vision, the situation can be even more daunting. The accurate determination 
of surface normals, using a technique such as photometric stereo (Section 12.1.1) or even a 
simpler operation such as accurate image deblurring, require that the measurements be in a 
linear space of intensities. Therefore, it is imperative when performing detailed quantitative 
computations such as these to first undo the gamma and the per-image color re -balancing 
in the sensed color values. Chakrabarti, Scharstein, and Zickler (2009) develop a sophisti- 
cated 24-parameter model that is a good match to the processing performed by today’s digital 
cameras; they also provide a database of color images you can use for your own testing. 23 

For other vision applications, however, such as feature detection or the matching of sig- 
nals in stereo and motion estimation, this linearization step is often not necessary. In fact, 
determining whether it is necessary to undo gamma can take some careful thinking, e.g., in 
the case of compensating for exposure variations in image stitching (see Exercise 2.7). 

If all of these processing steps sound confusing to model, they are. Exercise 2.10 has you 
try to tease apart some of these phenomena using empirical investigation, i.e., taking pictures 
of color charts and comparing the RAW and IPEG compressed color values. 

Other color spaces 

While RGB and XYZ are the primary color spaces used to describe the spectral content (and 
hence tri-stimulus response) of color signals, a variety of other representations have been 
developed both in video and still image coding and in computer graphics. 

The earliest color representation developed for video transmission was the YIQ standard 
developed for NTSC video in North America and the closely related YUV standard developed 
for PAL in Europe. In both of these cases, it was desired to have a luma channel Y (so called 

23 


http ://vision . middlebury. edu/color/. 
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since it only roughly mimics true luminance) that would be comparable to the regular black- 
and-white TV signal, along with two lower frequency chroma channels. 

In both systems, the Y signal (or more appropriately, the Y’ luma signal since it is gamma 
compressed) is obtained from 

Y 6 ' 01 = 0.2991?' + 0.587G" + 0.1145', (2.112) 

where R’G’B’ is the triplet of gamma-compressed color components. When using the newer 
color definitions for HDTV in BT.709, the formula is 

Y 7 ' 09 = 0.2 1 25??' + 0.7154G" + 0.07215'. (2.113) 

The UV components are derived from scaled versions of (5' — Y') and (R' — Y 1 ), namely, 

U = 0.492111(5' - Y') and V = 0.877283(7?' - Y'), (2.114) 

whereas the IQ components are the UV components rotated through an angle of 33°. In 
composite (NTSC and PAL) video, the chroma signals were then low-pass filtered horizon- 
tally before being modulated and superimposed on top of the Y’ luma signal. Backward 
compatibility was achieved by having older black-and-white TV sets effectively ignore the 
high-frequency chroma signal (because of slow electronics) or, at worst, superimposing it as 
a high-frequency pattern on top of the main signal. 

While these conversions were important in the early days of computer vision, when frame 
grabbers would directly digitize the composite TV signal, today all digital video and still 
image compression standards are based on the newer YCbCr conversion. YCbCr is closely 
related to YUV (the Cb and C r signals carry the blue and red color difference signals and have 
more useful mnemonics than UV) but uses different scale factors to fit within the eight-bit 
range available with digital signals. 

For video, the Y’ signal is re-scaled to fit within the [16 . . . 235] range of values, while 
the Cb and Cr signals are scaled to fit within [16 . . . 240] (Gomes and Velho 1997; Fairchild 
2005). For still images, the JPEG standard uses the full eight-bit range with no reserved 
values. 
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where the R’G’B’ values are the eight-bit gamma-compressed color components (i.e., the 
actual RGB values we obtain when we open up or display a JPEG image). For most appli- 
cations, this formula is not that important, since your image reading software will directly 
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provide you with the eight-bit gamma-compressed R’G’B’ values. However, if you are trying 
to do careful image deblocking (Exercise 3.30), this information may be useful. 

Another color space you may come across is hue, saturation, value (HS V), which is a pro- 
jection of the RGB color cube onto a non-linear chroma angle, a radial saturation percentage, 
and a luminance-inspired value. In more detail, value is defined as either the mean or maxi- 
mum color value, saturation is defined as scaled distance from the diagonal, and hue is defined 
as the direction around a color wheel (the exact formulas are described by Hall (1989); Foley, 
van Dam, Feiner et al. (1995)). Such a decomposition is quite natural in graphics applications 
such as color picking (it approximates the Munsell chart for color description). Figure 2.321- 
n shows an HSV representation of a sample color image, where saturation is encoded using a 
gray scale (saturated = darker) and hue is depicted as a color. 

If you want your computer vision algorithm to only affect the value (luminance) of an 
image and not its saturation or hue, a simpler solution is to use either the Y xy (luminance + 
chromaticity) coordinates defined in (2.104) or the even simpler color ratios, 


R G 

R + G + B 1 9 ~ R + G + B 1 


b = 


B 

R + G + B 


(2.116) 


(Figure 2.32e-h). After manipulating the luma (2. 1 12), e.g., through the process of histogram 
equalization (Section 3.1.4), you can multiply each color ratio by the ratio of the new to old 
luma to obtain an adjusted RGB triplet. 

While all of these color systems may sound confusing, in the end, it often may not mat- 
ter that much which one you use. Poynton, in his Color FAQ, http://www.poynton.com/ 
ColorFAQ.html, notes that the perceptually motivated L*a*b* system is qualitatively similar 
to the gamma-compressed R’G’B’ system we mostly deal with, since both have a fractional 
power scaling (which approximates a logarithmic response) between the actual intensity val- 
ues and the numbers being manipulated. As in all cases, think carefully about what you are 
trying to accomplish before deciding on a technique to use. 24 


2.3.3 Compression 

The last stage in a camera’s processing pipeline is usually some form of image compression 
(unless you are using a lossless compression scheme such as camera RAW or PNG). 

All color video and image compression algorithms start by converting the signal into 
YCbCr (or some closely related variant), so that they can compress the luminance signal with 
higher fidelity than the chrominance signal. (Recall that the human visual system has poorer 

24 If you are at a loss for questions at a conference, you can always ask why the speaker did not use a perceptual 
color space, such as L*a*b*. Conversely, if they did use L*a*b*, you can ask if they have any concrete evidence that 
this works better than regular colors. 
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Figure 2.32 Color space transformations: (a-d) RGB; (e-h) rgb. (i-k) L*a*b*; Cl— n) HSV. 
Note that the rgb, L*a*b*. and HSV values are all re-scaled to fit the dynamic range of the 
printed page. 


frequency response to color than to luminance changes.) In video, it is common to subsam- 
ple Cb and Cr by a factor of two horizontally; with still images (JPEG), the subsampling 
(averaging) occurs both horizontally and vertically. 

Once the luminance and chrominance images have been appropriately subsampled and 
separated into individual images, they are then passed to a block transform stage. The most 
common technique used here is the discrete cosine transform (DCT), which is a real-valued 
variant of the discrete Fourier transform (DFT) (see Section 3.4.3). The DCT is a reasonable 
approximation to the Karhunen-Foeve or eigenvalue decomposition of natural image patches, 
i.e., the decomposition that simultaneously packs the most energy into the first coefficients 
and diagonalizes the joint covariance matrix among the pixels (makes transform coefficients 
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Figure 2.33 Image compressed with JPEG at three quality settings. Note how the amount 
of block artifact and high-frequency aliasing (“mosquito noise”) increases from left to right. 


statistically independent). Both MPEG and JPEG use 8x8 DCT transforms (Wallace 1991; 
Le Gall 1991), although newer variants use smaller 4x4 blocks or alternative transformations, 
such as wavelets (Taubman and Marcellin 2002) and lapped transforms (Malvar 1990, 1998, 
2000) are now used. 

After transform coding, the coefficient values are quantized into a set of small integer 
values that can be coded using a variable bit length scheme such as a Huffman code or an 
arithmetic code (Wallace 1991). (The DC (lowest frequency) coefficients are also adaptively 
predicted from the previous block’s DC values. The term “DC” comes from “direct current”, 
i.e., the non-sinusoidal or non-alternating part of a signal.) The step size in the quantization 
is the main variable controlled by the quality setting on the JPEG file (Figure 2.33). 

With video, it is also usual to perform block-based motion compensation , i.e., to encode 
the difference between each block and a predicted set of pixel values obtained from a shifted 
block in the previous frame. (The exception is the motion-JPEG scheme used in older DV 
camcorders, which is nothing more than a series of individually JPEG compressed image 
frames.) While basic MPEG uses 16 x 16 motion compensation blocks with integer motion 
values (Le Gall 1991), newer standards use adaptively sized block, sub-pixel motions, and 
the ability to reference blocks from older frames. In order to recover more gracefully from 
failures and to allow for random access to the video stream, predicted P frames are interleaved 
among independently coded I frames. (Bi-directional B frames are also sometimes used.) 

The quality of a compression algorithm is usually reported using its peak signal-to-noise 
ratio (PSNR), which is derived from the average mean square error , 


MSE = [*(*)-/(*)] 2 , 


(2.117) 


where I(x) is the original uncompressed image and I(x) is its compressed counterpart, or 
equivalently, the root mean square error (RMS error), which is defined as 


RMS = VMSE. 


(2.118) 
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The PSNR is defined as 


PSNR= 10 log 10 


MSE 


201og 10 


Mriax 

RMS' 


(2.119) 


where / max is the maximum signal extent, e.g., 255 for eight-bit images. 

While this is just a high-level sketch of how image compression works, it is useful to 
understand so that the artifacts introduced by such techniques can be compensated for in 
various computer vision applications. 


2.4 Additional reading 

As we mentioned at the beginning of this chapter, it provides but a brief summary of a very 
rich and deep set of topics, traditionally covered in a number of separate fields. 

A more thorough introduction to the geometry of points, lines, planes, and projections 
can be found in textbooks on multi-view geometry (Hartley and Zisserman 2004; Faugeras 
and Luong 2001) and computer graphics (Foley, van Dam, Feiner et al. 1995; Watt 1995; 
OpenGL-ARB 1997). Topics covered in more depth include higher-order primitives such as 
quadrics, conics, and cubics, as well as three-view and multi-view geometry. 

The image formation (synthesis) process is traditionally taught as part of a computer 
graphics curriculum (Foley, van Dam, Feiner et al. 1995; Glassner 1995; Watt 1995; Shirley 
2005) but it is also studied in physics-based computer vision (Wolff, Shafer, and Healey 
1992a). 

The behavior of camera lens systems is studied in optics (Moller 1988; Hecht 2001; Ray 

2002). 

Some good books on color theory have been written by Healey and Shafer (1992); Wyszecki 
and Stiles (2000); Fairchild (2005), with Livingstone (2008) providing a more fun and infor- 
mal introduction to the topic of color perception. Mark Fairchild’s page of color books and 
links 2- ’ lists many other sources. 

Topics relating to sampling and aliasing are covered in textbooks on signal and image 
processing (Crane 1997; Jahne 1997; Oppenheim and Schafer 1996; Oppenheim, Schafer, 
and Buck 1999; Pratt 2007; Russ 2007; Burger and Burge 2008; Gonzales and Woods 2008). 


2.5 Exercises 

A note to students: This chapter is relatively light on exercises since it contains mostly 
background material and not that many usable techniques. If you really want to understand 

25 http://www.cis.rit.edu/fairchildAVhyIsColor/booksJinks.html. 
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multi-view geometry in a thorough way, I encourage you to read and do the exercises provided 
by Hartley and Zisserman (2004). Similarly, if you want some exercises related to the image 
formation process, Glassner’s (1995) book is full of challenging problems. 

Ex 2.1: Least squares intersection point and line fitting — advanced Equation (2.4) shows 
how the intersection of two 2D lines can be expressed as their cross product, assuming the 
lines are expressed as homogeneous coordinates. 

1 . If you are given more than two lines and want to find a point x that minimizes the sum 
of squared distances to each line. 


2. To fit a line to a bunch of points, you can compute the centroid (mean) of the points 
as well as the covariance matrix of the points around this mean. Show that the line 
passing through the centroid along the major axis of the covariance ellipsoid (largest 
eigenvector) minimizes the sum of squared distances to the points. 

3. These two approaches are fundamentally different, even though projective duality tells 
us that points and lines are interchangeable. Why are these two algorithms so appar- 
ently different? Are they actually minimizing different objectives? 

Ex 2.2: 2D transform editor Write a program that lets you interactively create a set of 
rectangles and then modify their “pose” (2D transform). You should implement the following 
steps: 

1. Open an empty window (“canvas”). 

2. Shift drag (rubber-band) to create a new rectangle. 

3. Select the deformation mode (motion model): translation, rigid, similarity, affine, or 
perspective. 

4. Drag any corner of the outline to change its transformation. 

This exercise should be built on a set of pixel coordinate and transformation classes, either 
implemented by yourself or from a software library. Persistence of the created representation 
(save and load) should also be supported (for each rectangle, save its transformation). 



( 2 . 120 ) 


how can you compute this quantity? (Hint: Write the dot product as x T li and turn the 
squared quantity into a quadratic form, x 1 Ax.) 
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Ex 2.3: 3D viewer Write a simple viewer for 3D points, lines, and polygons. Import a set 
of point and line commands (primitives) as well as a viewing transform. Interactively modify 
the object or camera transform. This viewer can be an extension of the one you created in 
(Exercise 2.2). Simply replace the viewing transformations with their 3D equivalents. 

(Optional) Add a z-buffer to do hidden surface removal for polygons. 

(Optional) Use a 3D drawing package and just write the viewer control. 

Ex 2.4: Focus distance and depth of field Figure out how the focus distance and depth of 
field indicators on a lens are determined. 

1 . Compute and plot the focus distance z 0 as a function of the distance traveled from the 
focal length Azi - f - z % for a lens of focal length / (say, 100mm). Does this explain 
the hyperbolic progression of focus distances you see on a typical lens (Figure 2.20)? 

2. Compute the depth of field (minimum and maximum focus distances) for a given focus 
setting z a as a function of the circle of confusion diameter c (make it a fraction of 
the sensor width), the focal length /, and the f-stop number N (which relates to the 
aperture diameter d). Does this explain the usual depth of field markings on a lens that 
bracket the in-focus marker, as in Figure 2.20a? 

3. Now consider a zoom lens with a varying focal length /. Assume that as you zoom, 
the lens stays in focus, i.e., the distance from the rear nodal point to the sensor plane 
Zj adjusts itself automatically for a fixed focus distance z 0 . How do the depth of field 
indicators vary as a function of focal length? Can you reproduce a two-dimensional 
plot that mimics the curved depth of field lines seen on the lens in Figure 2.20b? 

Ex 2.5: F-numbers and shutter speeds List the common f-numbers and shutter speeds 
that your camera provides. On older model SLRs, they are visible on the lens and shut- 
ter speed dials. On newer cameras, you have to look at the electronic viewfinder (or LCD 
screen/indicator) as you manually adjust exposures. 

1. Do these form geometric progressions; if so, what are the ratios? How do these relate 
to exposure values (EVs)? 

2. If your camera has shutter speeds of -/ and do you think that these two speeds are 
exactly a factor of two apart or a factor of 125/60 = 2.083 apart? 

3. How accurate do you think these numbers are? Can you devise some way to measure 
exactly how the aperture affects how much light reaches the sensor and what the exact 
exposure times actually are? 
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Ex 2.6: Noise level calibration Estimate the amount of noise in your camera by taking re- 
peated shots of a scene with the camera mounted on a tripod. (Purchasing a remote shutter 
release is a good investment if you own a DSLR.) Alternatively, take a scene with constant 
color regions (such as a color checker chart) and estimate the variance by fitting a smooth 
function to each color region and then taking differences from the predicted function. 

1 . Plot your estimated variance as a function of level for each of your color channels 
separately. 

2. Change the ISO setting on your camera; if you cannot do that, reduce the overall light 
in your scene (turn off lights, draw the curtains, wait until dusk). Does the amount of 
noise vary a lot with ISO/gain? 

3. Compare your camera to another one at a different price point or year of make. Is 
there evidence to suggest that “you get what you pay for”? Does the quality of digital 
cameras seem to be improving over time? 

Ex 2.7: Gamma correction in image stitching Here’s a relatively simple puzzle. Assume 
you are given two images that are part of a panorama that you want to stitch (see Chapter 9). 
The two images were taken with different exposures, so you want to adjust the RGB values 
so that they match along the seam line. Is it necessary to undo the gamma in the color values 
in order to achieve this? 

Ex 2.8: Skin color detection Devise a simple skin color detector (Forsyth and Fleck 1999; 
Jones and Rehg 2001; Vezhnevets, Sazonov, and Andreeva 2003; Kakumanu, Makrogiannis, 
and Bourbakis 2007) based on chromaticity or other color properties. 

1 . Take a variety of photographs of people and calculate the xy chromaticity values for 
each pixel. 

2. Crop the photos or otherwise indicate with a painting tool which pixels are likely to be 
skin (e.g. face and arms). 

3. Calculate a color (chromaticity) distribution for these pixels. You can use something as 
simple as a mean and covariance measure or as complicated as a mean-shift segmenta- 
tion algorithm (see Section 5.3.2). You can optionally use non-skin pixels to model the 
background distribution. 

4. Use your computed distribution to find the skin regions in an image. One easy way to 
visualize this is to paint all non-skin pixels a given color, such as white or black. 

5. How sensitive is your algorithm to color balance (scene lighting)? 
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6. Does a simpler chromaticity measurement, such as a color ratio (2.116), work just as 
well? 

Ex 2.9: White point balancing — tricky A common (in-camera or post-processing) tech- 
nique for performing white point adjustment is to take a picture of a white piece of paper and 
to adjust the RGB values of an image to make this a neutral color. 

1. Describe how you would adjust the RGB values in an image given a sample “white 
color” of ( R w , G w , B w ) to make this color neutral (without changing the exposure too 
much). 

2. Does your transformation involve a simple (per-channel) scaling of the RGB values or 
do you need a full 3x3 color twist matrix (or something else)? 

3. Convert your RGB values to XYZ. Does the appropriate correction now only depend 
on the XY (or xy) values? If so, when you convert back to RGB space, do you need a 
full 3 x 3 color twist matrix to achieve the same effect? 

4. If you used pure diagonal scaling in the direct RGB mode but end up with a twist if you 
work in XYZ space, how do you explain this apparent dichotomy? Which approach is 
correct? (Or is it possible that neither approach is actually correct?) 

If you want to find out what your camera actually does, continue on to the next exercise. 

Ex 2.10: In-camera color processing — challenging If your camera supports a RAW pixel 
mode, take a pair of RAW and JPEG images, and see if you can infer what the camera is doing 
when it converts the RAW pixel values to the final color-corrected and gamma-compressed 
eight-bit JPEG pixel values. 

1. Deduce the pattern in your color filter array from the correspondence between co- 
located RAW and color-mapped pixel values. Use a color checker chart at this stage 
if it makes your life easier. You may find it helpful to split the RAW image into four 
separate images (subsampling even and odd columns and rows) and to treat each of 
these new images as a “virtual” sensor. 

2. Evaluate the quality of the demosaicing algorithm by taking pictures of challenging 
scenes which contain strong color edges (such as those shown in in Section 10.3.1). 

3. If you can take the same exact picture after changing the color balance values in your 
camera, compare how these settings affect this processing. 

4. Compare your results against those presented by Chakrabarti, Scharstein, and Zickler 
(2009) or use the data available in their database of color images. 26 

26 


http ://vision . middlebury. edu/color/. 
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Figure 3.1 Some common image processing operations: (a) original image; (b) increased 
contrast; (c) change in hue; (d) “posterized” (quantized colors); (e) blurred; (f) rotated. 
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Now that we have seen how images are formed through the interaction of 3D scene elements, 
lighting, and camera optics and sensors, let us look at the first stage in most computer vision 
applications, namely the use of image processing to preprocess the image and convert it into 
a form suitable for further analysis. Examples of such operations include exposure correction 
and color balancing, the reduction of image noise, increasing sharpness, or straightening the 
image by rotating it (Figure 3.1). While some may consider image processing to be outside 
the purview of computer vision, most computer vision applications, such as computational 
photography and even recognition, require care in designing the image processing stages in 
order to achieve acceptable results. 

In this chapter, we review standard image processing operators that map pixel values from 
one image to another. Image processing is often taught in electrical engineering departments 
as a follow-on course to an introductory course in signal processing (Oppenheim and Schafer 
1996; Oppenheim, Schafer, and Buck 1999). There are several popular textbooks for image 
processing (Crane 1997; Gomes and Velho 1997; Jahne 1997; Pratt 2007; Russ 2007; Burger 
and Burge 2008; Gonzales and Woods 2008). 

We begin this chapter with the simplest kind of image transforms, namely those that 
manipulate each pixel independently of its neighbors (Section 3.1). Such transforms are of- 
ten called point operators or point processes. Next, we examine neighborhood (area-based) 
operators, where each new pixel’s value depends on a small number of neighboring input 
values (Sections 3.2 and 3.3). A convenient tool to analyze (and sometimes accelerate) such 
neighborhood operations is the Fourier Transform, which we cover in Section 3.4. Neighbor- 
hood operators can be cascaded to form image pyramids and wavelets, which are useful for 
analyzing images at a variety of resolutions (scales) and for accelerating certain operations 
(Section 3.5). Another important class of global operators are geometric transformations, 
such as rotations, shears, and perspective deformations (Section 3.6). Finally, we introduce 
global optimization approaches to image processing, which involve the minimization of an 
energy functional or, equivalently, optimal estimation using Bayesian Markov random field 
models (Section 3.7). 


3.1 Point operators 

The simplest kinds of image processing transforms are point operators, where each output 
pixel’s value depends on only the corresponding input pixel value (plus, potentially, some 
globally collected information or parameters). Examples of such operators include brightness 
and contrast adjustments (Figure 3.2) as well as color correction and transformations. In the 
image processing literature, such operations are also known as point processes (Crane 1997). 
We begin this section with a quick review of simple point operators such as brightness 
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Figure 3.2 Some local image processing operations: (a) original image along with its three 
color (per-channel) histograms; (b) brightness increased (additive offset, b = 16); (c) contrast 
increased (multiplicative gain, a = 1 . 1 ); (d) gamma (partially) linearized (7 = 1 . 2 ); (e) full 
histogram equalization; (f) partial histogram equalization. 
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(c) 



(d) 


Figure 3.3 Visualizing image data: (a) original image; (b) cropped portion and scanline plot 
using an image inspection tool; (c) grid of numbers; (d) surface plot. For figures (c)-(d), the 
image was first converted to grayscale. 


scaling and image addition. Next, we discuss how colors in images can be manipulated. 
We then present image compositing and matting operations, which play an important role 
in computational photography (Chapter 10) and computer graphics applications. Finally, we 
describe the more global process of histogram equalization. We close with an example appli- 
cation that manipulates tonal values (exposure and contrast) to improve image appearance. 

3.1.1 Pixel transforms 

A general image processing operator is a function that takes one or more input images and 
produces an output image. In the continuous domain, this can be denoted as 

g(x) = h(f(x)) or g{x) = h(f 0 (x), . . . , f n (x)), (3.1) 

where x is in the D-dimensional domain of the functions (usually D = 2 for images) and the 
functions / and g operate over some range, which can either be scalar or vector-valued, e.g., 
for color images or 2D motion. For discrete (sampled) images, the domain consists of a finite 
number of pixel locations, x = ( i,j ), and we can write 

dihj) = h(f(i,j)). (3.2) 

Figure 3.3 shows how an image can be represented either by its color (appearance), as a grid 
of numbers, or as a two-dimensional function (surface plot). 

Two commonly used point processes are multiplication and addition with a constant, 

g(x) = af(x) + b. (3.3) 

The parameters a > 0 and b are often called the gain and bias parameters; sometimes these 
parameters are said to control contrast and brightness, respectively (Figures 3.2b-c). 1 The 

1 An image’s luminance characteristics can also be summarized by its key (average luminanance) and range 
(Kopf, Uyttendaele, Deussen et at. 2007). 
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bias and gain parameters can also be spatially varying, 

g(x) = a(x)f(x) + b(x), (3.4) 

e.g., when simulating the graded density filter used by photographers to selectively darken 
the sky or when modeling vignetting in an optical system. 

Multiplicative gain (both global and spatially varying) is a linear operation, since it obeys 
the superposition principle. 


Kfo + fi) = h(fo) + h{fi). (3.5) 

(We will have more to say about linear shift invariant operators in Section 3.2.) Operators 
such as image squaring (which is often used to get a local estimate of the energy in a band- 
pass filtered signal, see Section 3.5) are not linear. 

Another commonly used dyadic (two-input) operator is the linear blend operator, 

g{ x ) = (1 — a)fo(x) + afi(x). (3.6) 

By varying a from 0 — > 1, this operator can be used to perform a temporal cross-dissolve 
between two images or videos, as seen in slide shows and film production, or as a component 
of image morphing algorithms (Section 3.6.3). 

One highly used non-linear transform that is often applied to images before further pro- 
cessing is gamma correction , which is used to remove the non-linear mapping between input 
radiance and quantized pixel values (Section 2.3.2). To invert the gamma mapping applied 
by the sensor, we can use 

9 O) = [/(*)] 1/7 . (3-7) 

where a gamma value of 7 ss 2.2 is a reasonable fit for most digital cameras. 

3.1.2 Color transforms 

While color images can be treated as arbitrary vector-valued functions or collections of inde- 
pendent bands, it usually makes sense to think about them as highly correlated signals with 
strong connections to the image formation process (Section 2.2), sensor design (Section 2.3), 
and human perception (Section 2.3.2). Consider, for example, brightening a picture by adding 
a constant value to all three channels, as shown in Figure 3.2b. Can you tell if this achieves the 
desired effect of making the image look brighter? Can you see any undesirable side-effects 
or artifacts? 

In fact, adding the same value to each color channel not only increases the apparent in- 
tensity of each pixel, it can also affect the pixel’s hue and saturation. How can we define and 
manipulate such quantities in order to achieve the desired perceptual effects? 
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(a) (b) (c) (d) 

Figure 3.4 Image matting and compositing (Chuang, Curless, Salesin et al. 2001) © 2001 
IEEE: (a) source image; (b) extracted foreground object F\ (c) alpha matte a shown in 
grayscale; (d) new composite C. 


As discussed in Section 2.3.2, chromaticity coordinates (2.104) or even simpler color ra- 
tios (2.116) can first be computed and then used after manipulating (e.g., brightening) the 
luminance Y to re-compute a valid RGB image with the same hue and saturation. Figure 
2.32g-i shows some color ratio images multiplied by the middle gray value for better visual- 
ization. 

Similarly, color balancing (e.g., to compensate for incandescent lighting) can be per- 
formed either by multiplying each channel with a different scale factor or by the more com- 
plex process of mapping to XYZ color space, changing the nominal white point, and mapping 
back to RGB, which can be written down using a linear 3x3 color twist transform matrix. 
Exercises 2.9 and 3.1 have you explore some of these issues. 

Another fun project, best attempted after you have mastered the rest of the material in 
this chapter, is to take a picture with a rainbow in it and enhance the strength of the rainbow 
(Exercise 3.29). 


3.1.3 Compositing and matting 

In many photo editing and visual effects applications, it is often desirable to cut a foreground 
object out of one scene and put it on top of a different background (Figure 3.4). The process 
of extracting the object from the original image is often called matting (Smith and Blinn 
1996), while the process of inserting it into another image (without visible artifacts) is called 
compositing (Porter and Duff 1984; Blinn 1994a). 

The intermediate representation used for the foreground object between these two stages 
is called an alpha-matted color image (Figure 3.4b-c). In addition to the three color RGB 
channels, an alpha-matted image contains a fourth alpha channel a (or A) that describes the 
relative amount of opacity or fractional coverage at each pixel (Figures 3.4c and 3.5b). The 
opacity is the opposite of the transparency. Pixels within the object are fully opaque (a = 1), 
while pixels fully outside the object are transparent (a = 0). Pixels on the boundary of the 
object vary smoothly between these two extremes, which hides the perceptual visible jaggies 




106 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



Figure 3.5 Compositing equation C = (1 — a) B + aF. The images are taken from a 
close-up of the region of the hair in the upper right part of the lion in Figure 3.4. 


that occur if only binary opacities are used. 

To composite a new (or foreground) image on top of an old (background) image, the over 
operator, first proposed by Porter and Duff (1984) and then studied extensively by Blinn 
(1994a; 1994b), is used, 

C = (1 — a)B + aF. (3.8) 

This operator attenuates the influence of the background image B by a factor (1 — a) and 
then adds in the color (and opacity) values corresponding to the foreground layer F, as shown 
in Figure 3.5. 

In many situations, it is convenient to represent the foreground colors in pre-multiplied 
form, i.e., to store (and manipulate) the aF values directly. As Blinn (1994b) shows, the 
pre-multiplied RGBA representation is preferred for several reasons, including the ability 
to blur or resample (e.g., rotate) alpha-matted images without any additional complications 
(just treating each RGBA band independently). However, when matting using local color 
consistency (Ruzon and Tomasi 2000; Chuang, Curless, Salesin el al. 2001), the pure un- 
multiplied foreground colors F are used, since these remain constant (or vary slowly) in the 
vicinity of the object edge. 

The over operation is not the only kind of compositing operation that can be used. Porter 
and Duff (1984) describe a number of additional operations that can be useful in photo editing 
and visual effects applications. In this book, we concern ourselves with only one additional, 
commonly occurring case (but see Exercise 3.2). 

When light reflects off clean transparent glass, the light passing through the glass and 
the light reflecting off the glass are simply added together (Figure 3.6). This model is use- 
ful in the analysis of transparent motion (Black and Anandan 1996; Szeliski, Avidan, and 
Anandan 2000), which occurs when such scenes are observed from a moving camera (see 
Section 8.5.2). 

The actual process of matting, i.e., recovering the foreground, background, and alpha 
matte values from one or more images, has a rich history, which we study in Section 10.4. 
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Figure 3.6 An example of light reflecting off the transparent glass of a picture frame (Black 
and Anandan 1996) © 1996 Elsevier. You can clearly see the woman’s portrait inside the 
picture frame superimposed with the reflection of a man’s face off the glass. 


Smith and Blinn (1996) have a nice survey of traditional blue-screen matting techniques, 
while Toyama, Krumm, Brumitt et al. (1999) review difference matting. More recently, there 
has been a lot of activity in computational photography relating to natural image matting 
(Ruzon and Tomasi 2000; Chuang, Curless, Salesin et al. 2001; Wang and Cohen 2007a), 
which attempts to extract the mattes from a single natural image (Figure 3.4a) or from ex- 
tended video sequences (Chuang, Agarwala, Curless et al. 2002). All of these techniques are 
described in more detail in Section 10.4. 

3.1.4 Histogram equalization 

While the brightness and gain controls described in Section 3.1.1 can improve the appearance 
of an image, how can we automatically determine their best values? One approach might 
be to look at the darkest and brightest pixel values in an image and map them to pure black 
and pure white. Another approach might be to find the average value in the image, push it 
towards middle gray, and expand the range so that it more closely fills the displayable values 
(Kopf, Uyttendaele, Deussen et al. 2007). 

How can we visualize the set of lightness values in an image in order to test some of 
these heuristics? The answer is to plot the histogram of the individual color channels and 
luminance values, as shown in Figure 3.7b. 2 From this distribution, we can compute relevant 
statistics such as the minimum, maximum, and average intensity values. Notice that the image 
in Figure 3.7a has both an excess of dark values and light values, but that the mid-range values 
are largely under-populated. Would it not be better if we could simultaneously brighten some 

2 The histogram is simply the count of the number of pixels at each gray level value. For an eight-bit image, an 
accumulation table with 256 entries is needed. For higher bit depths, a table with the appropriate number of entries 
(probably fewer than the full number of gray levels) should be used. 
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Figure 3.7 Histogram analysis and equalization: (a) original image (b) color channel and in- 
tensity (luminance) histograms; (c) cumulative distribution functions; (d) equalization (trans- 
fer) functions; (e) full histogram equalization; (f) partial histogram equalization. 


dark values and darken some light values, while still using the full extent of the available 
dynamic range? Can you think of a mapping that might do this? 

One popular answer to this question is to perform histogram equalization, i.e., to find 
an intensity mapping function /(/) such that the resulting histogram is flat. The trick to 
finding such a mapping is the same one that people use to generate random samples from 
a probability density function, which is to first compute the cumulative distribution function 
shown in Figure 3.7c. 

Think of the original histogram h(I) as the distribution of grades in a class after some 
exam. How can we map a particular grade to its corresponding percentile, so that students at 
the 75% percentile range scored better than 3 /4 of their classmates? The answer is to integrate 
the distribution h(I) to obtain the cumulative distribution c(I), 

1 1 i 

< J ) = ^E^) = c(/- + (3.9) 

V »= 0 

where N is the number of pixels in the image or students in the class. For any given grade or 
intensity, we can look up its corresponding percentile c(J) and determine the final value that 
pixel should take. When working with eight-bit pixel values, the I and c axes are rescaled 
from [0,255], 
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(a) (b) (c) 

Figure 3.8 Locally adaptive histogram equalization: (a) original image; (b) block histogram 
equalization; (c) full locally adaptive equalization. 


Figure 3.7d shows the result of applying /(/) = c(I) to the original image. As we 
can see, the resulting histogram is flat; so is the resulting image (it is “flat” in the sense 
of a lack of contrast and being muddy looking). One way to compensate for this is to only 
partially compensate for the histogram unevenness, e.g., by using a mapping function /(/) = 
ac(I) + (1 — a)/, which is a linear blend between the cumulative distribution function and 
the identity transform (a straight line). As you can see in Figure 3.7e, the resulting image 
maintains more of its original grayscale distribution while having a more appealing balance. 

Another potential problem with histogram equalization (or, in general, image brightening) 
is that noise in dark regions can be amplified and become more visible. Exercise 3.6 suggests 
some possible ways to mitigate this, as well as alternative techniques to maintain contrast and 
“punch” in the original images (Larson, Rushmeier, and Piatko 1997; Stark 2000). 

Locally adaptive histogram equalization 

While global histogram equalization can be useful, for some images it might be preferable 
to apply different kinds of equalization in different regions. Consider for example the image 
in Figure 3.8a, which has a wide range of luminance values. Instead of computing a single 
curve, what if we were to subdivide the image into M x M pixel blocks and perform separate 
histogram equalization in each sub-block? As you can see in Figure 3.8b, the resulting image 
exhibits a lot of blocking artifacts, i.e., intensity discontinuities at block boundaries. 

One way to eliminate blocking artifacts is to use a moving window , i.e., to recompute the 
histogram for every M x M block centered at each pixel. This process can be quite slow 
( M 2 operations per pixel), although with clever programming only the histogram entries 
corresponding to the pixels entering and leaving the block (in a raster scan across the image) 
need to be updated (M operations per pixel). Note that this operation is an example of the 
non-linear neighborhood operations we study in more detail in Section 3.3.1. 

A more efficient approach is to compute non-overlapped block-based equalization func- 
tions as before, but to then smoothly interpolate the transfer functions as we move between 
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Figure 3.9 Local histogram interpolation using relative (s, t) coordinates: (a) block-based 
histograms, with block centers shown as circles; (b) corner-based “spline” histograms. Pixels 
are located on grid intersections. The black square pixel’s transfer function is interpolated 
from the four adjacent lookup tables (gray arrows) using the computed (s, t) values. Block 
boundaries are shown as dashed lines. 

blocks. This technique is known as adaptive histogram equalization (AHE) and its contrast- 
limited (gain-limited) version is known as CLAHE (Pizer, Amburn, Austin et al. 1987). 3 The 
weighting function for a given pixel [i. j) can be computed as a function of its horizontal 
and vertical position (s, t) within a block, as shown in Figure 3.9a. To blend the four lookup 
functions {/oo, • • ■ , /n}, a bilinear blending function, 

fsA 1 ) = (1 - a)(l - t)f 00 (I) + s(l - t)f 10 (I) + (1 - s)tf 01 (I) + si/n(J) (3.10) 

can be used. (See Section 3.5.2 for higher-order generalizations of such spline functions.) 
Note that instead of blending the four lookup tables for each output pixel (which would be 
quite slow), we can instead blend the results of mapping a given pixel through the four neigh- 
boring lookups. 

A variant on this algorithm is to place the lookup tables at the corners of each M x M 
block (see Figure 3.9b and Exercise 3.7). In addition to blending four lookups to compute the 
final value, we can also distribute each input pixel into four adjacent lookup tables during the 
histogram accumulation phase (notice that the gray arrows in Figure 3.9b point both ways), 
i.e., 

hkA 1 (*>i)) += (3.11) 

where w(i,j,k,l) is the bilinear weighting function between pixel (i. j) and lookup table 
(k, l). This is an example of soft histogramming, which is used in a variety of other applica- 

3 This algorithm is implemented in the MATLAB adapthist function. 
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tions, including the construction of SIFT feature descriptors (Section 4.1.3) and vocabulary 
trees (Section 14.3.2). 

3.1.5 Application : Tonal adjustment 

One of the most widely used applications of point-wise image processing operators is the 
manipulation of contrast or tone in photographs, to make them look either more attractive or 
more interpre table. You can get a good sense of the range of operations possible by opening 
up any photo manipulation tool and trying out a variety of contrast, brightness, and color 
manipulation options, as shown in Figures 3.2 and 3.7. 

Exercises 3.1, 3.5, and 3.6 have you implement some of these operations, in order to 
become familiar with basic image processing operators. More sophisticated techniques for 
tonal adjustment (Reinhard, Ward, Pattanaik et al. 2005; Bae, Paris, and Durand 2006) are 
described in the section on high dynamic range tone mapping (Section 10.2.1). 


Locally adaptive histogram equalization is an example of a neighborhood operator or local 
operator, which uses a collection of pixel values in the vicinity of a given pixel to deter- 
mine its final output value (Figure 3.10). In addition to performing local tone adjustment, 
neighborhood operators can be used to filter images in order to add soft blur, sharpen de- 
tails, accentuate edges, or remove noise (Figure 3. 1 lb— d). In this section, we look at linear 
filtering operators, which involve weighted combinations of pixels in small neighborhoods. 
In Section 3.3, we look at non-linear operators such as morphological filters and distance 
transforms. 

The most commonly used type of neighborhood operator is a linear filter, in which an 
output pixel’s value is determined as a weighted sum of input pixel values (Figure 3.10), 


The entries in the weight kernel or mask h(k, l) are often called the filler coefficients. The 
above correlation operator can be more compactly notated as 
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(3.12) 


g = f ®h. 


(3.13) 


A common variant on this formula is 


g(hj) = k ,3 ~ l)h(k,l) = ^2 f(k,l)h(i - k,j - l ), 


(3.14) 
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Figure 3.10 Neighborhood filtering (convolution): The image on the left is convolved with 
the filter in the middle to yield the image on the right. The light blue pixels indicate the source 
neighborhood for the light green destination pixel. 


where the sign of the offsets in / has been reversed. This is called the convolution operator, 

9 = f*h, (3.15) 

and h is then called the impulse response function. 4 The reason for this name is that the kernel 
function, h, convolved with an impulse signal, S(i,j) (an image that is 0 everywhere except 
at the origin) reproduces itself, h* S = h, whereas correlation produces the reflected signal. 
(Try this yourself to verify that it is so.) 

In fact. Equation (3.14) can be interpreted as the superposition (addition) of shifted im- 
pulse response functions h(i — k,j — l) multiplied by the input pixel values f(k, l). Convolu- 
tion has additional nice properties, e.g., it is both commutative and associative. As well, the 
Fourier transform of two convolved images is the product of their individual Fourier trans- 
forms (Section 3.4). 

Both correlation and convolution are linear shift-invariant (LSI) operators, which obey 
both the superposition principle (3.5), 

h°(fo + fi) = ho f 0 + ho f u (3.16) 

and the shift invariance principle, 

9(hj) = f(i + k,j + l) 4* (h°g)(i,j) = (ho f)(i + k,j + 1), (3.17) 

which means that shifting a signal commutes with applying the operator (o stands for the LSI 
operator). Another way to think of shift invariance is that the operator “behaves the same 
everywhere”. 


4 The continuous version of convolution can be written as g(x) = f f(x — u)h(u)du. 
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(g) (h) 


Figure 3.11 Some neighborhood operations: (a) original image; (b) blurred; (c) sharpened; 
(d) smoothed with edge-preserving filter; (e) binary image; (f) dilated; (g) distance transform; 
(h) connected components. For the dilation and connected components, black (ink) pixels are 
assumed to be active, i.e., to have a value of 1 in Equations (3.41-3.45). 
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Figure 3.12 One-dimensional signal convolution as a sparse matrix-vector multiply, g = 
Hf. 


Occasionally, a shift-variant version of correlation or convolution may be used, e.g., 

9(i,j) = - ~ OMMo, j), (3-18) 

k,l 

where h(k,l',i,j) is the convolution kernel at pixel For example, such a spatially 

varying kernel can be used to model blur in an image due to variable depth-dependent defocus. 

Correlation and convolution can both be written as a matrix-vector multiply, if we first 
convert the two-dimensional images f(i, j ) and g{i,j) into raster-ordered vectors / and g, 

g = Hf, (3.19) 

where the (sparse) H matrix contains the convolution kernels. Figure 3.12 shows how a 
one-dimensional convolution can be represented in matrix-vector form. 

Padding (border effects) 

The astute reader will notice that the matrix multiply shown in Figure 3.12 suffers from 
boundary effects , i.e., the results of filtering the image in this form will lead to a darkening of 
the corner pixels. This is because the original image is effectively being padded with 0 values 
wherever the convolution kernel extends beyond the original image boundaries. 

To compensate for this, a number of alternative padding or extension modes have been 
developed (Figure 3.13): 

• zero: set all pixels outside the source image to 0 (a good choice for alpha-matted cutout 
images); 

• constant ( border color): set all pixels outside the source image to a specified border 
value; 

• clamp (replicate or clamp to edge): repeat edge pixels indefinitely; 

• (cyclic) wrap (repeat or tile): loop “around” the image in a “toroidal” configuration; 
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normalized zero 



blurred clamp 



blurred mirror 


Figure 3.13 Border padding (top row) and the results of blurring the padded image (bottom 
row). The normalized zero image is the result of dividing (normalizing) the blurred zero- 
padded RGBA image by its corresponding soft alpha value. 


• mirror : reflect pixels across the image edge; 

• extend', extend the signal by subtracting the mirrored version of the signal from the 
edge pixel value. 

In the computer graphics literature (Akenine-Moller and Haines 2002, p. 124), these mech- 
anisms are known as the wrapping mode (OpenGL) or texture addressing mode (Direct3D). 
The formulas for each of these modes are left to the reader (Exercise 3.8). 

Figure 3.13 shows the effects of padding an image with each of the above mechanisms and 
then blurring the resulting padded image. As you can see, zero padding darkens the edges, 
clamp (replication) padding propagates border values inward, mirror (reflection) padding pre- 
serves colors near the borders. Extension padding (not shown) keeps the border pixels fixed 
(during blur). 

An alternative to padding is to blur the zero-padded RGBA image and to then divide the 
resulting image by its alpha value to remove the darkening effect. The results can be quite 
good, as seen in the normalized zero image in Figure 3.13. 

3.2.1 Separable filtering 

The process of performing a convolution requires K 2 (multiply-add) operations per pixel, 
where K is the size (width or height) of the convolution kernel, e.g., the box filter in Fig- 
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(a) box, K = 5 


(b) bilinear 


(c) “Gaussian” 


(d) Sobel 


(e) corner 


Figure 3.14 Separable linear filters: For each image (a)-(e), we show the 2D filter kernel 
(top), the corresponding horizontal ID kernel (middle), and the filtered image (bottom). The 
filtered Sobel and corner images are signed, scaled up by 2x and 4x, respectively, and added 
to a gray offset before display. 


ure 3.14a. In many cases, this operation can be significantly sped up by first performing a 
one-dimensional horizontal convolution followed by a one-dimensional vertical convolution 
(which requires a total of 2K operations per pixel). A convolution kernel for which this is 
possible is said to be separable. 

It is easy to show that the two-dimensional kernel K corresponding to successive con- 
volution with a horizontal kernel h and a vertical kernel v is the outer product of the two 
kernels, 

K = vh T (3.20) 

(see Figure 3.14 for some examples). Because of the increased efficiency, the design of 
convolution kernels for computer vision applications is often influenced by their separability. 

How can we tell if a given kernel K is indeed separable? This can often be done by 
inspection or by looking at the analytic form of the kernel (Freeman and Adelson 1991). A 
more direct method is to treat the 2D kernel as a 2D matrix K and to take its singular value 
decomposition (SVD), 

K = UiUivJ (3.21) 

i 

(see Appendix A. 1.1 for the definition of the SVD). If only the first singular value (To is 
non-zero, the kernel is separable and sJctqUq and x /<7 it v^ provide the vertical and horizontal 
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kernels (Perona 1995). For example, the Laplacian of Gaussian kernel (3.26 and 4.23) can be 
implemented as the sum of two separable filters (4.24) (Wiejak, Buxton, and Buxton 1985). 

What if your kernel is not separable and yet you still want a faster way to implement 
it? Perona (1995), who first made the link between kernel separability and SVD, suggests 
using more terms in the (3.21) series, i.e., summing up a number of separable convolutions. 
Whether this is worth doing or not depends on the relative sizes of K and the number of sig- 
nificant singular values, as well as other considerations, such as cache coherency and memory 
locality. 


3.2.2 Examples of linear filtering 

Now that we have described the process for performing linear filtering, let us examine a 
number of frequently used filters. 

The simplest filter to implement is the moving average or box filter, which simply averages 
the pixel values in a K x K window. This is equivalent to convolving the image with a kernel 
of all ones and then scaling (Figure 3.14a). For large kernels, a more efficient implementation 
is to slide a moving window across each scanline (in a separable filter) while adding the 
newest pixel and subtracting the oldest pixel from the running sum. This is related to the 
concept of summed area tables , which we describe shortly. 

A smoother image can be obtained by separably convolving the image with a piecewise 
linear “tent” function (also known as a Bartlett filter). Figure 3.14b shows a 3 x 3 version 
of this filter, which is called the bilinear kernel, since it is the outer product of two linear 
(first-order) splines (see Section 3.5.2). 

Convolving the linear tent function with itself yields the cubic approximating spline, 
which is called the “Gaussian” kernel (Figure 3.14c) in Burt and Adelson’s (1983a) Lapla- 
cian pyramid representation (Section 3.5). Note that approximate Gaussian kernels can also 
be obtained by iterated convolution with box filters (Wells 1986). In applications where the 
filters really need to be rotationally symmetric, carefully tuned versions of sampled Gaussians 
should be used (Freeman and Adelson 1991) (Exercise 3.10). 

The kernels we just discussed are all examples of blurring (smoothing) or low-pass ker- 
nels (since they pass through the lower frequencies while attenuating higher frequencies). 
How good are they at doing this? In Section 3.4, we use frequency-space Fourier analysis to 
examine the exact frequency response of these filters. We also introduce the sine ((sin x) / x) 
filter, which performs ideal low-pass filtering. 

In practice, smoothing kernels are often used to reduce high-frequency noise. We have 
much more to say about using variants on smoothing to remove noise later (see Sections 3.3.1, 
3.4, and 3.7). 

Surprisingly, smoothing kernels can also be used to sharpen images using a process called 
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unsharp masking. Since blurring the image reduces high frequencies, adding some of the 
difference between the original and the blurred image makes it sharper, 

Ssharp = / + 7(/ - ^blur * /)■ (3.22) 

In fact, before the advent of digital photography, this was the standard way to sharpen images 
in the darkroom: create a blurred (“positive”) negative from the original negative by mis- 
focusing, then overlay the two negatives before printing the final image, which corresponds 
to 

Ounsharp = /(I T^blur * /)■ (3.23) 

This is no longer a linear filter but it still works well. 

Linear filtering can also be used as a pre-processing stage to edge extraction (Section 4.2) 
and interest point detection (Section 4. 1) algorithms. Figure 3. 14d shows a simple 3x3 edge 
extractor called the Sobel operator, which is a separable combination of a horizontal central 
difference (so called because the horizontal derivative is centered on the pixel) and a vertical 
tent filter (to smooth the results). As you can see in the image below the kernel, this filter 
effectively emphasizes horizontal edges. 

The simple corner detector (Figure 3.14e) looks for simultaneous horizontal and vertical 
second derivatives. As you can see however, it responds not only to the corners of the square, 
but also along diagonal edges. Better corner detectors, or at least interest point detectors that 
are more rotationally invariant, are described in Section 4. 1 . 


3.2.3 Band-pass and steerable filters 


The Sobel and corner operators are simple examples of band-pass and oriented filters. More 
sophisticated kernels can be created by first smoothing the image with a (unit area) Gaussian 
filter, 

G{x , y; a) = e~ , (3.24) 

and then taking the first or second derivatives (Marr 1982; Witkin 1983; Freeman and Adelson 
1991). Such filters are known collectively as band-pass filters, since they filter out both low 
and high frequencies. 

The (undirected) second derivative of a two-dimensional image. 


V 2 / = — + — 

J dx 2 dy 2 ’ 


(3.25) 


is known as the Laplacian operator. Blurring an image with a Gaussian and then taking its 
Laplacian is equivalent to convolving directly with the Laplacian of Gaussian (LoG) filter, 


V 2 G(i, y, a) = 


y 


— 2 


(3.26) 
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Figure 3.15 Second-order steerable filter (Freeman 1992) © 1992 IEEE: (a) original image 
of Einstein; (b) orientation map computed from the second-order oriented energy; (c) original 
image with oriented structures enhanced. 


which has certain nice scale-space properties (Witkin 1983; Witkin, Terzopoulos, and Kass 
1986). The five-point Laplacian is just a compact approximation to this more sophisticated 
filter. 

Likewise, the Sobel operator is a simple approximation to a directional or oriented filter, 
which can obtained by smoothing with a Gaussian (or some other filter) and then taking a 
directional derivative =t -©, which is obtained by taking the dot product between the 
gradient field V and a unit direction u = (cos 6 1 sin 6), 

fi-V(G*/) = V fl (G*/) = (V fi G)*/. (3.27) 


The smoothed directional derivative filter, 

BC BC 

G ii = uG x + vG y = u—+v—, (3.28) 

where u = (u. v), is an example of a steerable filter, since the value of an image convolved 
with Gfa can be computed by first convolving with the pair of filters (G x ,G y ) and then 
steering the filter (potentially locally) by multiplying this gradient field with a unit vector u 
(Freeman and Adelson 1991). The advantage of this approach is that a whole family of filters 
can be evaluated with very little cost. 

How about steering a directional second derivative filter • V^G^, which is the result 
of taking a (smoothed) directional derivative and then taking the directional derivative again? 
For example, G xx is the second directional derivative in the x direction. 

At first glance, it would appear that the steering trick will not work, since for every di- 
rection u, we need to compute a different first directional derivative. Somewhat surprisingly, 
Freeman and Adelson (1991) showed that, for directional Gaussian derivatives, it is possible 
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Figure 3.16 Fourth-order steerable filter (Freeman and Adelson 1991) © 1991 IEEE: (a) 
test image containing bars (lines) and step edges at different orientations; (b) average oriented 
energy; (c) dominant orientation; (d) oriented energy as a function of angle (polar plot). 


to steer any order of derivative with a relatively small number of basis functions. For example, 
only three basis functions are required for the second-order directional derivative, 

Gyjj = u 2 G xx + 2uvG xy + v 2 G yy . (3.29) 

Furthermore, each of the basis filters, while not itself necessarily separable, can be computed 
using a linear combination of a small number of separable filters (Freeman and Adelson 
1991). 

This remarkable result makes it possible to construct directional derivative filters of in- 
creasingly greater directional selectivity, i.e., filters that only respond to edges that have 
strong local consistency in orientation (Figure 3.15). Furthermore, higher order steerable 
filters can respond to potentially more than a single edge orientation at a given location, and 
they can respond to both bar edges (thin lines) and the classic step edges (Figure 3.16). In 
order to do this, however, full Hilbert transform pairs need to be used for second-order and 
higher filters, as described in (Freeman and Adelson 1991). 

Steerable filters are often used to construct both feature descriptors (Section 4.1.3) and 
edge detectors (Section 4.2). While the filters developed by Freeman and Adelson (1991) 
are best suited for detecting linear (edge-like) structures, more recent work by Koethe (2003) 
shows how a combined 2x2 boundary tensor can be used to encode both edge and junction 
(“corner”) features. Exercise 3.12 has you implement such steerable filters and apply them to 
finding both edge and comer features. 


Summed area table (integral image) 

If an image is going to be repeatedly convolved with different box filters (and especially filters 
of different sizes at different locations), you can precompute the summed area table (Crow 
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(a) S= 24 (b) s = 28 (c) S = 24 

Figure 3.17 Summed area tables: (a) original image; (b) summed area table; (c) computation 
of area sum. Each value in the summed area table s(i,j) (red) is computed recursively from 
its three adjacent (blue) neighbors (3.31). Area sums S (green) are computed by combining 
the four values at the rectangle corners (purple) (3.32). Positive values are shown in bold and 
negative values in italics. 


1984), which is just the running sum of all the pixel values from the origin, 

i j 

s(i,j) = (3.30) 

k—0 1=0 

This can be efficiently computed using a recursive (raster-scan) algorithm, 

s{i,j) = s(i - 1 ,j) + s(i,j - 1) - s(i - 1, j - 1) + (3.31) 

The image s(i,j) is also often called an integral image (see Figure 3.17) and can actually be 
computed using only two additions per pixel if separate row sums are used (Viola and Jones 
2004). To find the summed area (integral) inside a rectangle [*o,ii] x [jo, ji], we simply 
combine four samples from the summed area table, 

n ji 

s(i 0 • • -it, jo • • • ji) = s (*i> ji) - s(*i> jo - 1) - s (*0 - 1, ji) + s (*0 - 1, jo - !)• 

i=io 0=0 o 

(3.32) 

A potential disadvantage of summed area tables is that they require log M + log N extra bits 
in the accumulation image compared to the original image, where M and N are the image 
width and height. Extensions of summed area tables can also be used to approximate other 
convolution kernels (Wolberg (1990, Section 6.5.2) contains a review). 

In computer vision, summed area tables have been used in face detection (Viola and 
Jones 2004) to compute simple multi-scale low-level features. Such features, which consist of 
adjacent rectangles of positive and negative values, are also known as boxlets (Simard, Bottou, 
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Haffner et al. 1998). In principle, summed area tables could also be used to compute the sums 
in the sum of squared differences (SSD) stereo and motion algorithms (Section 11.4). In 
practice, separable moving average filters are usually preferred (Kanade, Yoshida, Oda et al. 
1996), unless many different window shapes and sizes are being considered (Veksler 2003). 

Recursive filtering 

The incremental formula (3.31) for the summed area is an example of a recursive filter, i.e., 
one whose values depends on previous filter outputs. In the signal processing literature, such 
filters are known as infinite impulse response (IIR), since the output of the filter to an impulse 
(single non-zero value) goes on forever. For example, for a summed area table, an impulse 
generates an infinite rectangle of Is below and to the right of the impulse. The filters we have 
previously studied in this chapter, which involve the image with a finite extent kernel, are 
known as finite impulse response (FIR). 

Two-dimensional IIR filters and recursive formulas are sometimes used to compute quan- 
tities that involve large area interactions, such as two-dimensional distance functions (Sec- 
tion 3.3.3) and connected components (Section 3.3.4). 

More commonly, however, IIR filters are used inside one-dimensional separable filtering 
stages to compute large-extent smoothing kernels, such as efficient approximations to Gaus- 
sians and edge filters (Deriche 1990; Nielsen, Florack, and Deriche 1997). Pyramid-based 
algorithms (Section 3.5) can also be used to perform such large-area smoothing computations. 


3.3 More neighborhood operators 

As we have just seen, linear filters can perform a wide variety of image transformations. 
However non-linear filters, such as edge-preserving median or bilateral filters, can sometimes 
perform even better. Other examples of neighborhood operators include morphological oper- 
ators that operate on binary images, as well as semi-global operators that compute distance 
transforms and find connected components in binary images (Figure 3.11 f— h) . 

3.3.1 Non-linear filtering 

The filters we have looked at so far have all been linear, i.e., their response to a sum of two 
signals is the same as the sum of the individual responses. This is equivalent to saying that 
each output pixel is a weighted summation of some number of input pixels (3.19). Linear 
filters are easier to compose and are amenable to frequency response analysis (Section 3.4). 

In many cases, however, better performance can be obtained by using a non-linear com- 
bination of neighboring pixels. Consider for example the image in Figure 3.18e, where the 
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(a) (b) (c) (d) 



(e) (f) (g) (h) 


Figure 3.18 Median and bilateral filtering: (a) original image with Gaussian noise; (b) Gaus- 
sian filtered; (c) median filtered; (d) bilaterally filtered; (e) original image with shot noise; (f) 
Gaussian filtered; (g) median filtered; (h) bilaterally filtered. Note that the bilateral filter fails 
to remove the shot noise because the noisy pixels are too different from their neighbors. 
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Figure 3.19 Median and bilateral filtering: (a) median pixel (green); (b) selected n -trimmed 
mean pixels; (c) domain filter (numbers along edge are pixel distances); (d) range filter. 
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noise, rather than being Gaussian, is shot noise, i.e., it occasionally has very large values. In 
this case, regular blurring with a Gaussian filter fails to remove the noisy pixels and instead 
turns them into softer (but still visible) spots (Figure 3.18f). 

Median filtering 

A better filter to use in this case is the median filter, which selects the median value from each 
pixel’s neighborhood (Figure 3. 19a). Median values can be computed in expected linear time 
using a randomized select algorithm (Cormen 2001) and incremental variants have also been 
developed by Tomasi and Manduchi (1998) and Bovik (2000, Section 3.2). Since the shot 
noise value usually lies well outside the true values in the neighborhood, the median filter is 
able to filter away such bad pixels (Figure 3.18c). 

One downside of the median filter, in addition to its moderate computational cost, is that 
since it selects only one input pixel value to replace each output pixel, it is not as efficient at 
averaging away regular Gaussian noise (Huber 1981; Hampel, Ronchetti, Rousseeuw et al. 
1986; Stewart 1999). A better choice may be the a-trimmed mean (Lee and Redner 1990) 
(Crane 1997, p. 109), which averages together all of the pixels except for the a fraction that 
are the smallest and the largest (Figure 3.19b). 

Another possibility is to compute a weighted median, in which each pixel is used a num- 
ber of times depending on its distance from the center. This turns out to be equivalent to 
minimizing the weighted objective function 

^2w(k,l)\f(i + k,j + l) - g(i,j)\ p , (3.33) 

k,l 

where g(i,j) is the desired output value and p = 1 for the weighted median. The value p = 2 
is the usual weighted mean, which is equivalent to correlation (3.12) after normalizing by the 
sum of the weights (Bovik 2000, Section 3.2) (Haralick and Shapiro 1992, Section 7.2.6). 
The weighted mean also has deep connections to other methods in robust statistics (see Ap- 
pendix B.3), such as influence functions (Huber 1981; Hampel, Ronchetti, Rousseeuw el al. 
1986). 

Non-linear smoothing has another, perhaps even more important property, especially 
since shot noise is rare in today’s cameras. Such filtering is more edge presen’ing, i.e., it 
has less tendency to soften edges while filtering away high-frequency noise. 

Consider the noisy image in Figure 3.18a. In order to remove most of the noise, the 
Gaussian filter is forced to smooth away high-frequency detail, which is most noticeable near 
strong edges. Median filtering does better but, as mentioned before, does not do as good 
a job at smoothing away from discontinuities. See (Tomasi and Manduchi 1998) for some 
additional references to edge -preserving smoothing techniques. 
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While we could try to use the o-tri mined mean or weighted median, these techniques still 
have a tendency to round sharp corners, since the majority of pixels in the smoothing area 
come from the background distribution. 


Bilateral filtering 


What if we were to combine the idea of a weighted filter kernel with a better version of outlier 
rejection? What if instead of rejecting a fixed percentage a, we simply reject (in a soft way) 
pixels whose values differ too much from the central pixel value? This is the essential idea in 
bilateral filtering, which was first popularized in the computer vision community by Tomasi 
and Manduchi (1998). Chen, Paris, and Durand (2007) and Paris, Kornprobst, Tumblin et al. 
(2008) cite similar earlier work (Aurich and Weule 1995; Smith and Brady 1997) as well as 
the wealth of subsequent applications in computer vision and computational photography. 

In the bilateral filter, the output pixel value depends on a weighted combination of neigh- 
boring pixel values 


_ E k ,i f(k,l)w(i,j,k,l) 

E yK'fei.M) 


(3.34) 


The weighting coefficient ui(i, j, k, l) depends on the product of a domain kernel (Figure 3. 19c), 


d(i,j,k,l ) = exp - 


(i - kf + C j ~ If 

2 ^ 


and a data-dependent range kernel (Figure 3.19d), 

- /(M)ll 2 


r{i, j , k, l) = exp - 


2(7.? 


(3.35) 


(3.36) 


When multiplied together, these yield the data-dependent bilateral weight function 


w(i,j,k,l) = exp 


(*-A:) 2 + (j-0 2 


2(7 


2 

d 


ll/(Fj)-/(fc,OI| 2 

2(7? 


(3.37) 


Figure 3.20 shows an example of the bilateral filtering of a noisy step edge. Note how the do- 
main kernel is the usual Gaussian, the range kernel measures appearance (intensity) similarity 
to the center pixel, and the bilateral filter kernel is a product of these two. 

Notice that the range filter (3.36) uses the vector distance between the center and the 
neighboring pixel. This is important in color images, since an edge in any one of the color 
bands signals a change in material and hence the need to downweight a pixel’s influence. 3 

5 Tomasi and Manduchi (1998) show that using the vector distance (as opposed to filtering each color band 
separately) reduces color fringing effects. They also recommend taking the color difference in the more perceptually 
uniform CIELAB color space (see Section 2.3.2). 
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(a) (b) (c) 




Figure 3.20 Bilateral filtering (Durand and Dorsey 2002) © 2002 ACM: (a) noisy step 
edge input; (b) domain filter (Gaussian); (c) range filter (similarity to center pixel value); (d) 
bilateral filter; (e) filtered step edge output; (f) 3D distance between pixels. 

Since bilateral filtering is quite slow compared to regular separable filtering, a number 
of acceleration techniques have been developed (Durand and Dorsey 2002; Paris and Durand 
2006; Chen, Paris, and Durand 2007; Paris, Kornprobst, Tumblin et al. 2008). Unfortunately, 
these techniques tend to use more memory than regular filtering and are hence not directly 
applicable to filtering full -color images. 

Iterated adaptive smoothing and anisotropic diffusion 

Bilateral (and other) filters can also be applied in an iterative fashion, especially if an appear- 
ance more like a “cartoon” is desired (Tomasi and Manduchi 1998). When iterated filtering 
is applied, a much smaller neighborhood can often be used. 

Consider, for example, using only the four nearest neighbors, i.e., restricting \k — i\ + \l — 
j | < 1 in (3.34). Observe that 



(3.38) 



\k -i\ + \l -j | = 0, 
\k~i\ + \l-j\ = 1. 


(3.39) 
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We can thus re-write (3.34) as 




f {t) (i, j) + v T,k,i f {t) l ) r (^ 0 


(3.40) 


1 + vJ2 k ,ir{i,j,k,l) 



where R = Yl(k lkj r(i, j, k, l), (k, l) are the A /4 neighbors of {i. j), and we have made the 
iterative nature of the filtering explicit. 

As Barash (2002) notes, (3.40) is the same as the discrete anisotropic diffusion equation 
first proposed by Perona and Malik (1990b). 6 Since its original introduction, anisotropic dif- 
fusion has been extended and applied to a wide range of problems (Nielsen, Florack, and De- 
riche 1997; Black, Sapiro, Marimont et al. 1998; Weickert, ter Haar Romeny, and Viergever 
1998; Weickert 1998). It has also been shown to be closely related to other adaptive smooth- 
ing techniques (Saint-Marc, Chen, and Medioni 1991; Barash 2002; Barash and Comaniciu 
2004) as well as Bayesian regularization with a non-linear smoothness term that can be de- 
rived from image statistics (Scharr, Black, and Haussecker 2003). 

In its general form, the range kernel r(i,j , k,l) = r(||/(«, j) — f(k, Z) || ), which is usually 
called the gain or edge-stopping function, or diffusion coefficient, can be any monotonically 
increasing function with r'(x) — > 0 as x — > 00 . Black, Sapiro, Marimont et al. (1998) show 
how anisotropic diffusion is equivalent to minimizing a robust penalty function on the image 
gradients, which we discuss in Sections 3.7.1 and 3.7.2). Scharr, Black, and Haussecker 
(2003) show how the edge-stopping function can be derived in a principled manner from 
local image statistics. They also extend the diffusion neighborhood from A /4 to As, which 
allows them to create a diffusion operator that is both rotationally invariant and incorporates 
information about the eigenvalues of the local structure tensor. 

Note that, without a bias term towards the original image, anisotropic diffusion and itera- 
tive adaptive smoothing converge to a constant image. Unless a small number of iterations is 
used (e.g., for speed), it is usually preferable to formulate the smoothing problem as a joint 
minimization of a smoothness term and a data fidelity term, as discussed in Sections 3.7.1 
and 3.7.2 and by Scharr, Black, and Haussecker (2003), which introduce such a bias in a 
principled manner. 

3.3.2 Morphology 

While non-linear filters are often used to enhance grayscale and color images, they are also 
used extensively to process binary images. Such images often occur after a thresholding 

6 The 1/(1 + // Rj factor is not present in anisotropic diffusion but becomes negligible as 7 / — - 0. 
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If It 2 Ir i > 

(a) (b) (c) (d) (e) (f) 


Figure 3.21 Binary image morphology: (a) original image; (b) dilation; (c) erosion; (d) 
majority; (e) opening; (f) closing. The structuring element for all examples is a 5 x 5 square. 
The effects of majority are a subtle rounding of sharp corners. Opening fails to eliminate the 
dot, since it is not wide enough. 


operation. 


*(/,*) 


1 if / > t, 
0 else, 


(3.41) 


e.g., converting a scanned grayscale document into a binary image for further processing such 
as optical character recognition. 

The most common binary image operations are called morphological operations , since 
they change the shape of the underlying binary objects (Ritter and Wilson 2000, Chapter 7). 
To perform such an operation, we first convolve the binary image with a binary structuring 
element and then select a binary output value depending on the thresholded result of the 
convolution. (This is not the usual way in which these operations are described, but I find it 
a nice simple way to unify the processes.) The structuring element can be any shape, from 
a simple 3x3 box filter, to more complicated disc structures. It can even correspond to a 
particular shape that is being sought for in the image. 

Figure 3.21 shows a close-up of the convolution of a binary image / with a 3 x 3 struc- 
turing element s and the resulting images for the operations described below. Let 


c = f ® s 


(3.42) 


be the integer- valued count of the number of Is inside each structuring element as it is scanned 
over the image and S be the size of the structuring element (number of pixels). The standard 
operations used in binary morphology include: 

• dilation: dilat e(/, s) = 0(c, 1); 

• erosion: erode(/, s) = 0(c,S); 

• majority: maj(/,s) = 0(c,S/2); 

• opening: open (/, s) = dilate(erode(/, s), s); 
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• closing: close(/, s) = erode (dilate( /, s), s). 

As we can see from Figure 3.21, dilation grows (thickens) objects consisting of Is, while 
erosion shrinks (thins) them. The opening and closing operations tend to leave large regions 
and smooth boundaries unaffected, while removing small objects or holes and smoothing 
boundaries. 

While we will not use mathematical morphology much in the rest of this book, it is a 
handy tool to have around whenever you need to clean up some thresholded images. You 
can find additional details on morphology in other textbooks on computer vision and image 
processing (Haralick and Shapiro 1992, Section 5.2) (Bovik 2000, Section 2.2) (Ritter and 
Wilson 2000, Section 7) as well as articles and books specifically on this topic (Serra 1982; 
Serra and Vincent 1992; Yuille, Vincent, and Geiger 1992; Soille 2006). 

3.3.3 Distance transforms 

The distance transform is useful in quickly precomputing the distance to a curve or set of 
points using a two-pass raster algorithm (Rosenfeld and Pfaltz 1966; Danielsson 1980; Borge- 
fors 1986; Paglieroni 1992; Breu, Gil, Kirkpatrick et al. 1995; Felzenszwalb and Huttenlocher 
2004a; Fabbri, Costa, Torelli et al. 2008). It has many applications, including level sets (Sec- 
tion 5.1.4), fast chamfer matching (binary image alignment) (Huttenlocher, Klanderman, and 
Rucklidge 1993), feathering in image stitching and blending (Section 9.3.2), and nearest point 
alignment (Section 12.2.1). 

The distance transform D(i,j ) of a binary image b(i,j) is defined as follows. Let d(k, l ) 
be some distance metric between pixel offsets. Two commonly used metrics include the city 
block or Manhattan distance 

dr(M) = |*| + |*| (3-43) 

and the Euclidean distance 

d 2 (k,l) = \fk 2 +l 2 . (3.44) 

The distance transform is then defined as 

D(i,j) = min d(i — k,j — l), (3.45) 

k,l:b(k,l)=0 

i.e., it is the distance to the nearest background pixel whose value is 0. 

The Di city block distance transform can be efficiently computed using a forward and 
backward pass of a simple raster-scan algorithm, as shown in Figure 3.22. During the forward 
pass, each non-zero pixel in b is replaced by the minimum of 1 + the distance of its north or 
west neighbor. During the backward pass, the same occurs, except that the minimum is both 
over the current value D and 1 + the distance of the south and east neighbors (Figure 3.22). 
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Figure 3.22 City block distance transform: (a) original binary image; (b) top to bottom 
(forward) raster sweep: green values are used to compute the orange value; (c) bottom to top 
(backward) raster sweep: green values are merged with old orange value; (d) final distance 
transform. 


Efficiently computing the Euclidean distance transform is more complicated. Here, just 
keeping the minimum scalar distance to the boundary during the two passes is not sufficient. 
Instead, a vector-valued distance consisting of both the x and y coordinates of the distance 
to the boundary must be kept and compared using the squared distance (hypotenuse) rule. As 
well, larger search regions need to be used to obtain reasonable results. Rather than explaining 
the algorithm (Danielsson 1980; Borgefors 1986) in more detail, we leave it as an exercise 
for the motivated reader (Exercise 3.13). 

Figure 3.1 lg shows a distance transform computed from a binary image. Notice how 
the values grow away from the black (ink) regions and form ridges in the white area of the 
original image. Because of this linear growth from the starting boundary pixels, the distance 
transform is also sometimes known as the grassfire transform , since it describes the time at 
which a fire starting inside the black region would consume any given pixel, or a chamfer, 
because it resembles similar shapes used in woodworking and industrial design. The ridges 
in the distance transform become the skeleton (or medial axis transform (MAT)) of the region 
where the transform is computed, and consist of pixels that are of equal distance to two (or 
more) boundaries (Tek and Kimia 2003; Sebastian and Kimia 2005). 

A useful extension of the basic distance transform is the signed distance transform, which 
computes distances to boundary pixels for all the pixels (Lavallee and Szeliski 1995). The 
simplest way to create this is to compute the distance transforms for both the original bi- 
nary image and its complement and to negate one of them before combining. Because such 
distance fields tend to be smooth, it is possible to store them more compactly (with mini- 
mal loss in relative accuracy) using a spline defined over a quadtree or octree data structure 
(Lavallee and Szeliski 1995; Szeliski and Lavallee 1996; Frisken, Perry, Rockwood et al. 
2000). Such precomputed signed distance transforms can be extremely useful in efficiently 
aligning and merging 2D curves and 3D surfaces (Huttenlocher, Klanderman, and Rucklidge 



Figure 3.23 Connected component computation: (a) original grayscale image; (b) horizontal 
runs (nodes) connected by vertical (graph) edges (dashed blue) — runs are pseudocolored with 
unique colors inherited from parent nodes; (c) re-coloring after merging adjacent segments. 


1993; Szeliski and Lavallee 1996; Curless and Levoy 1996), especially if the vectorial version 
of the distance transform, i.e., a pointer from each pixel or voxel to the nearest boundary or 
surface element, is stored and interpolated. Signed distance fields are also an essential com- 
ponent of level set evolution (Section 5.1.4), where they are called characteristic functions. 


3.3.4 Connected components 

Another useful semi-global image operation is finding connected components, which are de- 
fined as regions of adjacent pixels that have the same input value (or label). (In the remainder 
of this section, consider pixels to be adjacent if they are immediate A /4 neighbors and they 
have the same input value.) Connected components can be used in a variety of applications, 
such as finding individual letters in a scanned document or finding objects (say, cells) in a 
thresholded image and computing their area statistics. 

Consider the grayscale image in Figure 3.23a. There are four connected components in 
this figure: the outermost set of white pixels, the large ring of gray pixels, the white enclosed 
region, and the single gray pixel. These are shown pseudocolored in Figure 3.23c as pink, 
green, blue, and brown. 

To compute the connected components of an image, we first (conceptually) split the image 
into horizontal runs of adjacent pixels, and then color the runs with unique labels, re-using 
the labels of vertically adjacent runs whenever possible. In a second phase, adjacent runs of 
different colors are then merged. 

While this description is a little sketchy, it should be enough to enable a motivated stu- 
dent to implement this algorithm (Exercise 3.14). Haralick and Shapiro (1992, Section 2.3) 
give a much longer description of various connected component algorithms, including ones 
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that avoid the creation of a potentially large re-coloring (equivalence) table. Well-debugged 
connected component algorithms are also available in most image processing libraries. 

Once a binary or multi-valued image has been segmented into its connected components, 
it is often useful to compute the area statistics for each individual region 1 Z. Such statistics 
include: 


• the area (number of pixels); 


• the perimeter (number of boundary pixels); 

• the centroid (average x and y values); 

• the second moments. 


M= V 

(x,y)en 


x — X 


1 

IS* 

1 

s* 

1 



x — x y — y 


(3.46) 


from which the major and minor axis orientation and lengths can be computed using 
eigenvalue analysis. 7 


These statistics can then be used for further processing, e.g., for sorting the regions by the area 
size (to consider the largest regions first) or for preliminary matching of regions in different 
images. 


3.4 Fourier transforms 

In Section 3.2, we mentioned that Fourier analysis could be used to analyze the frequency 
characteristics of various filters. In this section, we explain both how Fourier analysis lets us 
determine these characteristics (or equivalently, the frequency content of an image) and how 
using the Fast Fourier Transform (FFT) lets us perform large-kernel convolutions in time that 
is independent of the kernel’s size. More comprehensive introductions to Fourier transforms 
are provided by Bracewell (1986); Glassner (1995); Oppenheim and Schafer (1996); Oppen- 
heim, Schafer, and Buck (1999). 

How can we analyze what a given filter does to high, medium, and low frequencies? The 
answer is to simply pass a sinusoid of known frequency through the filter and to observe by 
how much it is attenuated. Let 

s(x) = sin(27r/x + fa) = sin(wa; + fa) (3.47) 

7 Moments can also be computed using Green’s theorem applied to the boundary pixels (Yang and Albregtsen 
1996 ). 
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Figure 3.24 The Fourier Transform as the response of a filter h(x) to an input sinusoid 

s(x) = e> ux yielding an output sinusoid o(:r) = h(x) * s(x) = Ae j ux +4>. 

be the input sinusoid whose frequency is /, angular frequency is uj = 2nf, and phase is fa. 
Note that in this section, we use the variables x and y to denote the spatial coordinates of an 
image, rather than i and j as in the previous sections. This is both because the letters i and j 
are used for the imaginary number (the usage depends on whether you are reading complex 
variables or electrical engineering literature) and because it is clearer how to distinguish the 
horizontal ( x ) and vertical (y) components in frequency space. In this section, we use the 
letter j for the imaginary number, since that is the form more commonly found in the signal 
processing literature (Bracewell 1986; Oppenheim and Schafer 1996; Oppenheim, Schafer, 
and Buck 1999). 

If we convolve the sinusoidal signal s(x) with a filter whose impulse response is h(x), 
we get another sinusoid of the same frequency but different magnitude A and phase fa, 

o(x) = h{ x) * s(x) = Asin(W + fa), (3.48) 

as shown in Figure 3.24. To see that this is the case, remember that a convolution can be 
expressed as a weighted summation of shifted input signals (3.14) and that the summation of 
a bunch of shifted sinusoids of the same frequency is just a single sinusoid at that frequency. 8 
The new magnitude A is called the gain or magnitude of the filter, while the phase difference 
A <f> = 4> 0 — fa is called the shift or phase. 

In fact, a more compact notation is to use the complex-valued sinusoid 

s(x) = = cos tox + j sin wx. (3.49) 

In that case, we can simply write, 

o(x) = h(x) * s(x) = Ae :jux+ ^. (3.50) 

8 If h is a general (non-linear) transform, additional harmonic frequencies are introduced. This was traditionally 
the bane of audiophiles, who insisted on equipment with no harmonic distortion. Now that digital audio has intro- 
duced pure distortion-free sound, some audiophiles are buying retro tube amplifiers or digital signal processors that 
simulate such distortions because of their “warmer sound”. 
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The Fourier transform is simply a tabulation of the magnitude and phase response at each 
frequency, 

H(u) = T {h{x)} = Ae**, (3.51) 

i.e., it is the response to a complex sinusoid of frequency u> passed through the filter h(x). 
The Fourier transform pair is also often written as 

h(x) <-> (3.52) 


Unfortunately, (3.51) does not give an actual formula for computing the Fourier transform. 
Instead, it gives a recipe, i.e., convolve the filter with a sinusoid, observe the magnitude and 
phase shift, repeat. Fortunately, closed form equations for the Fourier transform exist both in 
the continuous domain, 


H M 


h(x)e * ux dx, 


(3.53) 


and in the discrete domain. 


H{k) = -^ %) e (3-54) 

x=0 

where N is the length of the signal or region of analysis. These formulas apply both to filters, 
such as h(x), and to signals or images, such as s(x) or g(x). 

The discrete form of the Fourier transform (3.54) is known as the Discrete Fourier Trans- 
form (DFT). Note that while (3.54) can be evaluated for any value of k, it only makes sense 
for values in the range k £ This is because larger values of k alias with lower 

frequencies and hence provide no additional information, as explained in the discussion on 
aliasing in Section 2.3.1. 

At face value, the DFT takes 0(N 2 ) operations (multiply-adds) to evaluate. Fortunately, 
there exists a faster algorithm called the Fast Fourier Transform (FFT), which requires only 
0(N log 2 N) operations (Bracewell 1986; Oppenheim, Schafer, and Buck 1999). We do not 
explain the details of the algorithm here, except to say that it involves a series of log 2 N 
stages, where each stage performs small 2x2 transforms (matrix multiplications with known 
coefficients) followed by some semi-global permutations. (You will often see the term but- 
terfly applied to these stages because of the pictorial shape of the signal processing graphs 
involved.) Implementations for the FFT can be found in most numerical and signal processing 
libraries. 

Now that we have defined the Fourier transform, what are some of its properties and how 
can they be used? Table 3.1 lists a number of useful properties, which we describe in a little 
more detail below: 
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Property 

Signal 

Transform 

superposition 

/iO) + h(x) 

Fi (w) + F 2 (w) 

shift 

f(x - Xo) 

F(u)e- ju,X0 

reversal 

f(-x) 

F*(cc) 

convolution 

f(x) * h(x) 

F(u)H(u) 

correlation 

f(x) 0 h( x) 

F{u)H*(u) 

multiplication 

f(x)h(x) 

F(w) * H{uj) 

differentiation 

fix) 

ju)F(u)) 

domain scaling 

f(ax) 

1 /aF(u>/a) 

real images 

fix) = f* (:r) 

F(lu) = F(—U ! ) 

Parseval’s Theorem 

£J/(*)] 2 

= 


Table 3.1 Some useful properties of Fourier transforms. The original transform pair is 

F(u) = F{f{x)}. 


• Superposition: The Fourier transform of a sum of signals is the sum of their Fourier 
transforms. Thus, the Fourier transform is a linear operator. 

• Shift: The Fourier transform of a shifted signal is the transform of the original signal 
multiplied by a linear phase shift (complex sinusoid). 

• Reversal: The Fourier transform of a reversed signal is the complex conjugate of the 
signal’s transform. 

• Convolution: The Fourier transform of a pair of convolved signals is the product of 
their transforms. 

• Correlation: The Fourier transform of a correlation is the product of the first transform 
times the complex conjugate of the second one. 

• Multiplication: The Fourier transform of the product of two signals is the convolution 
of their transforms. 

• Differentiation: The Fourier transform of the derivative of a signal is that signal’s 
transform multiplied by the frequency. In other words, differentiation linearly empha- 
sizes (magnifies) higher frequencies. 

• Domain scaling: The Fourier transform of a stretched signal is the equivalently com- 
pressed (and scaled) version of the original transform and vice versa. 
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• Real images: The Fourier transform of a real-valued signal is symmetric around the 
origin. This fact can be used to save space and to double the speed of image FFTs 
by packing alternating scanlines into the real and imaginary parts of the signal being 
transformed. 

• Parseval’s Theorem: The energy (sum of squared values) of a signal is the same as 
the energy of its Fourier transform. 

All of these properties are relatively straightforward to prove (see Exercise 3.15) and they will 
come in handy later in the book, e.g., when designing optimum Wiener filters (Section 3.4.3) 
or performing fast image correlations (Section 8.1.2). 

3.4.1 Fourier transform pairs 

Now that we have these properties in place, let us look at the Fourier transform pairs of some 
commonly occurring filters and signals, as listed in Table 3.2. In more detail, these pairs are 
as follows: 

• Impulse: The impulse response has a constant (all frequency) transform. 

• Shifted impulse: The shifted impulse has unit magnitude and linear phase. 

• Box filter: The box (moving average) filter 


which has an infinite number of side lobes. Conversely, the sine filter is an ideal low- 
pass filter. For a non-unit box, the width of the box a and the spacing of the zero 
crossings in the sine 1/a are inversely proportional. 

• Tent: The piecewise linear tent function. 



(3.55) 


has a sine Fourier transform. 


sinc(ut) 


smut 


(3.56) 


tent(:r) = max(0, 1 — |at|), 


(3.57) 


has a sine 2 Fourier transform. 


• Gaussian: The (unit area) Gaussian of width a, 



(3.58) 


has a (unit height) Gaussian of width a 1 as its Fourier transform. 
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Name 

impulse 

shifted 

impulse 

box filter 
tent 

Gaussian 

Laplacian 
of Gaussian 

Gabor 

unsharp 

mask 

windowed 

sine 




Signal 


Transform 


^O) 


1 

5 {x — u) 


e -juu 

box(a :/a) 


asinc(aw) 

tent {x/a) 


asinc 2 (aw) 

G(x ; a) 


^G^a- 1 ) 

(53 - ^)G(x;cr) 


-^cv 2 G(u;;a-i) 

cos(qjox)G(x; a) 


^G(uj ± uj 0 ; er -1 ) 

(1 + 7 ) 6 ( 2 :) 
- jG(x-,a) 


(1 + 7 )- 

rcos(a:/(aFF)) 

sinc(a;/a) 


(see Figure 3.29) 





Table 3.2 Some useful (continuous) Fourier transform pairs: The dashed line in the Fourier 
transform of the shifted impulse indicates its (linear) phase. All other transforms have zero 
phase (they are real-valued). Note that the figures are not necessarily drawn to scale but 
are drawn to illustrate the general shape and characteristics of the filter or its response. In 
particular, the Laplacian of Gaussian is drawn inverted because it resembles more a “Mexican 
hat”, as it is sometimes called. 
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• Laplacian of Gaussian: The second derivative of a Gaussian of width a. 



(3.59) 


has a band-pass response of 



(3.60) 


as its Fourier transform. 

• Gabor: The even Gabor function, which is the product of a cosine of frequency ujq and 
a Gaussian of width a, has as its transform the sum of the two Gaussians of width a" 1 
centered at u> = ±ujq. The odd Gabor function, which uses a sine, is the difference 
of two such Gaussians. Gabor functions are often used for oriented and band-pass 
filtering, since they can be more frequency selective than Gaussian derivatives. 

• Unsharp mask: The unsharp mask introduced in (3.22) has as its transform a unit 
response with a slight boost at higher frequencies. 

• Windowed sine: The windowed (masked) sine function shown in Table 3.2 has a re- 
sponse function that approximates an ideal low -pass filter better and better as additional 
side lobes are added ( W is increased). Figure 3.29 shows the shapes of these such fil- 
ters along with their Fourier transforms. For these examples, we use a one -lobe raised 
cosine. 


also known as the Harm window, as the windowing function. Wolberg (1990) and 
Oppenheim, Schafer, and Buck (1999) discuss additional windowing functions, which 
include the Lanczos window, the positive first lobe of a sine function. 

We can also compute the Fourier transforms for the small discrete kernels shown in Fig- 
ure 3.14 (see Table 3.3). Notice how the moving average filters do not uniformly dampen 
higher frequencies and hence can lead to ringing artifacts. The binomial filter (Gomes and 
Velho 1997) used as the “Gaussian” in Burt and Adelson’s (1983a) Laplacian pyramid (see 
Section 3.5), does a decent job of separating the high and low frequencies, but still leaves 
a fair amount of high-frequency detail, which can lead to aliasing after downsampling. The 
Sobel edge detector at first linearly accentuates frequencies, but then decays at higher fre- 
quencies, and hence has trouble detecting fine-scale edges, e.g., adjacent black and white 
columns. We look at additional examples of small kernel Fourier transforms in Section 3.5.2, 
where we study better kernels for pre-filtering before decimation (size reduction). 



(3.61) 
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Name Kernel Transform 


Plot 


box-3 


box-5 




Table 3.3 Fourier transforms of the separable kernels shown in Figure 3.14. 
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3.4.2 Two-dimensional Fourier transforms 


The formulas and insights we have developed for one-dimensional signals and their trans- 
forms translate directly to two-dimensional images. Here, instead of just specifying a hor- 
izontal or vertical frequency u> x or uj v , we can create an oriented sinusoid of frequency 

s(x,y) = sin(u x x + uj v y ) . (3.62) 

The corresponding two-dimensional Fourier transforms are then 


H(uj x ,u) y ) 

and in the discrete domain. 



h(x,y)e ^ U: ° x+WyV ^dxdy, 


ky) 


1 

MN 


M—l N—l 

E \ 0 k, x x -\-ky y 

Y h(x,y)e~ J 77 MN , 


a;— 0 y = 0 


(3.63) 


(3.64) 


where M and N are the width and height of the image. 

All of the Fourier transform properties from Table 3.1 carry over to two dimensions if 
we replace the scalar variables x, u, xq and a with their 2D vector counterparts x = (x, y), 
uj = (t o x ,u> y ), Xq = (xo,yo), and a = (a x ,a y ), and use vector inner products instead of 
multiplications. 


3.4.3 Wiener filtering 

While the Fourier transform is a useful tool for analyzing the frequency characteristics of a 
filter kernel or image, it can also be used to analyze the frequency spectrum of a whole class 
of images. 

A simple model for images is to assume that they are random noise fields whose expected 
magnitude at each frequency is given by this power spectrum P s (ut x ,u) y ), i.e., 

([SV^Wy)] 2 ) = P s (u x ,uj v ), (3.65) 

where the angle brackets (•) denote the expected (mean) value of a random variable. 9 To 
generate such an image, we simply create a random Gaussian noise image S(u) x ,u> y ) where 
each “pixel” is a zero-mean Gaussian 10 of variance P s (ui x . uj y ) and then take its inverse FFT. 

The observation that signal spectra capture a first-order description of spatial statistics 
is widely used in signal and image processing. In particular, assuming that an image is a 

9 The notation is also commonly used. 

10 We set the DC (i.e., constant) component at S(0, 0) to the mean grey level. See Algorithm C. 1 in Appendix C.2 
for code to generate Gaussian noise. 


3.4 Fourier transforms 


141 


sample from a correlated Gaussian random noise field combined with a statistical model of 
the measurement process yields an optimum restoration filter known as the Wiener filter. 11 

To derive the Wiener filter, we analyze each frequency component of a signal’s Fourier 
transform independently. The noisy image formation process can be written as 


o{x,y) = s(x,y) + n(x,y), 


(3.66) 


where s(x, y) is the (unknown) image we are trying to recover, n(x, y) is the additive noise 
signal, and o(x, y) is the observed noisy image. Because of the linearity of the Fourier trans- 
form, we can write 

0(u) x ,u>y) = S(uj x ,uiy) + N(i0 x ,Uy), (3.67) 

where each quantity in the above equation is the Fourier transform of the corresponding 
image. 

At each frequency (u> x ,u> y ), we know from our image spectrum that the unknown trans- 
form component S(u> x ,u> y ) has a prior distribution which is a zero-mean Gaussian with vari- 
ance P s {io x ,u y ). We also have noisy measurement O(u> x ,oj y ) whose variance is P n (u> x ,u> y ), 
i.e., the power spectrum of the noise, which is usually assumed to be constant (white), 

P n (u x ,u y ) = a 2 . 

According to Bayes’ Rule (Appendix B.4), the posterior estimate of S can be written as 


p(S\0) = 


p(Q\S)p(S) 

p(O) 


(3.68) 


where p(0) = J s p(0\S)p(S) is a normalizing constant used to make thep(S'|0) distribution 
proper (integrate to 1). The prior distribution p(S) is given by 


(s-n ) 2 

p(S) = e^sr 


(3.69) 


where // is the expected mean at that frequency (0 everywhere except at the origin) and the 
measurement distribution P(0|S') is given by 


p{S) = e 


(S-O) 2 
2 P n 


(3.70) 


Taking the negative logarithm of both sides of (3.68) and setting p, = 0 for simplicity, we 
get 


-logp(S|0) = ~\ogp(0\S)-\ogp(S) + C (3.71) 

= i/aP-HS - O) 2 + i hP^S 2 + C, (3.72) 

1 1 Wiener is pronounced “veener” since, in German, the “w” is pronounced “v”. Remember that next time you 
order “Wiener schnitzel”. 
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Figure 3.25 One-dimensional Wiener filter: (a) power spectrum of signal P s (/), noise level 
a 2 , and Wiener filter transform W(f); (b) Wiener filter spatial kernel. 


which is the negative posterior log likelihood. The minimum of this quantity is easy to 
compute, 

P” 1 P„ 1 

(3.73) 


The quantity 


W{u 


y) = 


1 


(3.74) 


l + <rU p s(Vx,v y ) 

is the Fourier transform of the optimum Wiener filter needed to remove the noise from an 
image whose power spectrum is P s (u> x ,Lu y ). 

Notice that this filter has the right qualitative properties, i.e., for low frequencies where 
P s a 2 , it has unit gain, whereas for high frequencies, it attenuates the noise by a factor 
P s /a Figure 3.25 shows the one-dimensional transform W(f) and the corresponding filter 
kernel w(x) for the commonly assumed case of P(f) = f~ 2 (Field 1987). Exercise 3.16 has 
you compare the Wiener filter as a denoising algorithm to hand-tuned Gaussian smoothing. 

The methodology given above for deriving the Wiener filter can easily be extended to the 
case where the observed image is a noisy blurred version of the original image. 


o(x , y) = b(x, y) * s(x, y ) + n(x, y), (3.75) 

where b(x, y) is the known blur kernel. Rather than deriving the corresponding Wiener fil- 
ter, we leave it as an exercise (Exercise 3.17), which also encourages you to compare your 
de -blurring results with unsharp masking and naive inverse filtering. More sophisticated al- 
gorithms for blur removal are discussed in Sections 3.7 and 10.3. 


Discrete cosine transform 

The discrete cosine transform (DCT) is a variant of the Fourier transform particularly well- 
suited to compressing images in a block-wise fashion. The one -dimensional DCT is com- 
puted by taking the dot product of each JV-wide block of pixels with a set of cosines of 
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Figure 3.26 Discrete cosine transform (DCT) basis functions: The first DC (i.e., constant) 
basis is the horizontal blue line, the second is the brown half-cycle waveform, etc. These 
bases are widely used in image and video compression standards such as JPEG. 

different frequencies, 


where k is the coefficient (frequency) index, and the '^-pixel offset is used to make the 
basis coefficients symmetric (Wallace 1991). Some of the discrete cosine basis functions are 
shown in Figure 3.26. As you can see, the first basis function (the straight blue line) encodes 
the average DC value in the block of pixels, while the second encodes a slightly curvy version 
of the slope. 

In turns out that the DCT is a good approximation to the optimal Karhunen-Loeve decom- 
position of natural image statistics over small patches, which can be obtained by performing 
a principal component analysis (PCA) of images, as described in Section 14.2.1. The KL- 
transform de-correlates the signal optimally (assuming the signal is described by its spectrum) 
and thus, theoretically, leads to optimal compression. 

The two-dimensional version of the DCT is defined similarly. 


Like the 2D Fast Fourier Transform, the 2D DCT can be implemented separably, i.e., first 
computing the DCT of each line in the block and then computing the DCT of each resulting 
column. Like the LFT, each of the DCTs can also be computed in 0(N log N) time. 

As we mentioned in Section 2.3.3, the DCT is widely used in today’s image and video 
compression algorithms, although it is slowly being supplanted by wavelet algorithms (Si- 
moncelli and Adelson 1990b), as discussed in Section 3.5.4, and overlapped variants of the 
DCT (Malvar 1990, 1998, 2000), which are used in the new JPEG XR standard. 12 These 

12 http://www.itu.int/rec/T-REC-T.832-200903-I/en. 
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newer algorithms suffer less from the blocking artifacts (visible edge-aligned discontinuities) 
that result from the pixels in each block (typically 8x8) being transformed and quantized 
independently. See Exercise 3.30 for ideas on how to remove blocking artifacts from com- 
pressed JPEG images. 

3.4.4 Application : Sharpening, blur, and noise removal 

Another common application of image processing is the enhancement of images through the 
use of sharpening and noise removal operations, which require some kind of neighborhood 
processing. Traditionally, these kinds of operation were performed using linear filtering (see 
Sections 3.2 and Section 3.4.3). Today, it is more common to use non-linear filters (Sec- 
tion 3.3.1), such as the weighted median or bilateral filter (3.34-3.37), anisotropic diffusion 
(3.39-3.40), or non-local means (Buades, Coll, and Morel 2008). Variational methods (Sec- 
tion 3.7.1), especially those using non-quadratic (robust) norms such as the L\ norm (which 
is called total variation ), are also often used. Figure 3.19 shows some examples of linear and 
non-linear filters being used to remove noise. 

When measuring the effectiveness of image denoising algorithms, it is common to report 
the results as a peak signal-to-noise ratio (PSNR) measurement (2.119), where /( x) is the 
original (noise-free) image and I (x) is the image after denoising; this is for the case where the 
noisy image has been synthetically generated, so that the clean image is known. A better way 
to measure the quality is to use a perceptually based similarity metric, such as the structural 
similarity (SSIM) index (Wang, Bovik, Sheikh el al. 2004; Wang, Bovik, and Simoncelli 
2005). 

Exercises 3.11, 3.16, 3.17, 3.21, and 3.28 have you implement some of these operations 
and compare their effectiveness. More sophisticated techniques for blur removal and the 
related task of super-resolution are discussed in Section 10.3. 

3.5 Pyramids and wavelets 

So far in this chapter, all of the image transformations we have studied produce output images 
of the same size as the inputs. Often, however, we may wish to change the resolution of an 
image before proceeding further. For example, we may need to interpolate a small image to 
make its resolution match that of the output printer or computer screen. Alternatively, we 
may want to reduce the size of an image to speed up the execution of an algorithm or to save 
on storage space or transmission time. 

Sometimes, we do not even know what the appropriate resolution for the image should 
be. Consider, for example, the task of finding a face in an image (Section 14.1.1). Since we 
do not know the scale at which the face will appear, we need to generate a whole pyramid 
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of differently sized images and scan each one for possible faces. (Biological visual systems 
also operate on a hierarchy of scales (Marr 1982).) Such a pyramid can also be very helpful 
in accelerating the search for an object by first finding a smaller instance of that object at a 
coarser level of the pyramid and then looking for the full resolution object only in the vicinity 
of coarse-level detections (Section 8.1.1). Finally, image pyramids are extremely useful for 
performing multi-scale editing operations such as blending images while maintaining details. 

In this section, we first discuss good filters for changing image resolution, i.e., upsampling 
( interpolation , Section 3.5.1) and downsampling ( decimation , Section 3.5.2). We then present 
the concept of multi-resolution pyramids, which can be used to create a complete hierarchy 
of differently sized images and to enable a variety of applications (Section 3.5.3). A closely 
related concept is that of wavelets, which are a special kind of pyramid with higher frequency 
selectivity and other useful properties (Section 3.5.4). Finally, we present a useful application 
of pyramids, namely the blending of different images in a way that hides the seams between 
the image boundaries (Section 3.5.5). 

3.5.1 Interpolation 

In order to interpolate (or upsample ) an image to a higher resolution, we need to select some 
interpolation kernel with which to convolve the image, 

= 5Z /(&> l)h{i ~ rk,j — rl). (3.78) 

k,l 

This formula is related to the discrete convolution formula (3.14), except that we replace k 
and l in h() with rk and rl, where r is the upsampling rate. Figure 3.27a shows how to think 
of this process as the superposition of sample weighted interpolation kernels, one centered 
at each input sample k. An alternative mental model is shown in Figure 3.27b, where the 
kernel is centered at the output pixel value i (the two forms are equivalent). The latter form 
is sometimes called the polyphase filter form, since the kernel values h(i) can be stored as r 
separate kernels, each of which is selected for convolution with the input samples depending 
on the phase of i relative to the upsampled grid. 

What kinds of kernel make good interpolators? The answer depends on the application 
and the computation time involved. Any of the smoothing kernels shown in Tables 3.2 and 3.3 
can be used after appropriate re-scaling. 13 The linear interpolator (corresponding to the tent 
kernel) produces interpolating piecewise linear curves, which result in unappealing creases 
when applied to images (Figure 3.28a). The cubic B-spline, whose discrete 1 / 2 -pixel sam- 
pling appears as the binomial kernel in Table 3.3, is an approximating kernel (the interpolated 

13 The smoothing kernels in Table 3.3 have a unit area. To turn them into interpolating kernels, we simply scale 
them up by the interpolation rate r. 
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Figure 3.27 Signal interpolation, g(i) = f(k)h(i — rk ): (a) weighted summation of 
input values; (b) polyphase filter interpretation. 

image does not pass through the input data points) that produces soft images with reduced 
high-frequency detail. The equation for the cubic B-spline is easiest to derive by convolving 
the tent function (linear B-spline) with itself. 

While most graphics cards use the bilinear kernel (optionally combined with a MIP- 
map — see Section 3.5.3), most photo editing packages use bicubic interpolation. The cu- 
bic interpolant is a C 1 (derivative-continuous) piecewise-cubic spline (the term “spline” is 
synonymous with “piecewise-polynomial”) 14 whose equation is 

f 1 — (a + 3)x 2 + (a + 2)|x| 3 if |x| < 1 
h(x) = < a(\x\ — l)(|x| — 2) 2 if 1 < |x| < 2 (3.79) 

I 0 otherwise, 

where a specifies the derivative at x = 1 (Parker, Kenyon, and Troxel 1983). The value of 
a is often set to —1, since this best matches the frequency characteristics of a sine function 
(Figure 3.29). It also introduces a small amount of sharpening, which can be visually appeal- 
ing. Unfortunately, this choice does not linearly interpolate straight lines (intensity ramps), 
so some visible ringing may occur. A better choice for large amounts of interpolation is prob- 
ably a = —0.5, which produces a quadratic reproducing spline; it interpolates linear and 
quadratic functions exactly (Wolberg 1990, Section 5.4.3). Figure 3.29 shows the a = — 1 
and a = —0.5 cubic interpolating kernel along with their Fourier transforms; Figure 3.28b 
and c shows them being applied to two-dimensional interpolation. 

Splines have long been used for function and data value interpolation because of the abil- 
ity to precisely specify derivatives at control points and efficient incremental algorithms for 
their evaluation (Bartels, Beatty, and Barsky 1987; Farin 1992, 1996). Splines are widely used 
in geometric modeling and computer-aided design (CAD) applications, although they have 

14 The term “spline” comes from the draughtsman’s workshop, where it was the name of a flexible piece of wood 
or metal used to draw smooth curves. 
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(c) (d) 


Figure 3.28 Two-dimensional image interpolation: (a) bilinear; (b) bicubic (a = —1); (c) 
bicubic (a = —0.5); (d) windowed sine (nine taps). 




Figure 3.29 (a) Some windowed sine functions and (b) their log Fourier transforms: raised- 
cosine windowed sine in blue, cubic interpolators (a = —1 and a = —0.5) in green and 
purple, and tent function in brown. They are often used to perform high-accuracy low-pass 
filtering operations. 
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started being displaced by subdivision surfaces (Zorin, Schroder, and Sweldens 1996; Peters 
and Reif 2008). In computer vision, splines are often used for elastic image deformations 
(Section 3.6.2), motion estimation (Section 8.3), and surface interpolation (Section 12.3). In 
fact, it is possible to carry out most image processing operations by representing images as 
splines and manipulating them in a multi-resolution framework (Unser 1999). 

The highest quality interpolator is generally believed to be the windowed sine function 
because it both preserves details in the lower resolution image and avoids aliasing. (It is also 
possible to construct a C 1 piecewise-cubic approximation to the windowed sine by matching 
its derivatives at zero crossing (Szeliski and Ito 1986).) However, some people object to the 
excessive ringing that can be introduced by the windowed sine and to the repetitive nature 
of the ringing frequencies (see Figure 3.28d). For this reason, some photographers prefer 
to repeatedly interpolate images by a small fractional amount (this tends to de-correlate the 
original pixel grid with the final image). Additional possibilities include using the bilat- 
eral filter as an interpolator (Kopf, Cohen, Lischinski el al. 2007), using global optimization 
(Section 3.6) or hallucinating details (Section 10.3). 

3.5.2 Decimation 

While interpolation can be used to increase the resolution of an image, decimation (downsam- 
pling) is required to reduce the resolution. 15 To perform decimation, we first (conceptually) 
convolve the image with a low-pass filter (to avoid aliasing) and then keep every rth sample. 
In practice, we usually only evaluate the convolution at every rth sample, 


as shown in Figure 3.30. Note that the smoothing kernel h(k, l ), in this case, is often a 
stretched and re-scaled version of an interpolation kernel. Alternatively, we can write 


and keep the same kernel h(k, l ) for both interpolation and decimation. 

One commonly used (r = 2) decimation filter is the binomial filter introduced by Burt 
and Adelson (1983a). As shown in Table 3.3, this kernel does a decent job of separating 
the high and low frequencies, but still leaves a fair amount of high-frequency detail, which 
can lead to aliasing after downsampling. However, for applications such as image blending 
(discussed later in this section), this aliasing is of little concern. 

15 The term '‘decimation” has a gruesome etymology relating to the practice of killing every tenth soldier in 
a Roman unit guilty of cowardice. It is generally used in signal processing to mean any downsampling or rate 
reduction operation. 
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Figure 3.30 Signal decimation: (a) the original samples are (b) convolved with a low-pass 
filter before being downsampled. 


If, however, the downsampled images will be displayed directly to the user or, perhaps, 
blended with other resolutions (as in MIP-mapping, Section 3.5.3), a higher-quality filter is 
desired. For high downsampling rates, the windowed sine pre-filter is a good choice (Fig- 
ure 3.29). However, for small downsampling rates, e.g., r = 2, more careful filter design is 
required. 

Table 3.4 shows a number of commonly used r = 2 downsampling filters, while Fig- 
ure 3.31 shows their corresponding frequency responses. These filters include: 

• the linear [1,2,1] filter gives a relatively poor response; 

• the binomial [1,4, 6, 4, 1] filter cuts off a lot of frequencies but is useful for computer 
vision analysis pyramids; 

• the cubic filters from (3.79); the a = — 1 filter has a sharper fall-off than the a = —0.5 
filter (Figure 3.31); 


\n\ 

Linear 

Binomial 

Cubic 

a = — 1 

Cubic 
a = -0.5 

Windowed 

sine 

QMF-9 

JPEG 

2000 

0 

0.50 

0.3750 

0.5000 

0.50000 

0.4939 

0.5638 

0.6029 

1 

0.25 

0.2500 

0.3125 

0.28125 

0.2684 

0.2932 

0.2669 

2 


0.0625 

0.0000 

0.00000 

0.0000 

-0.0519 

-0.0782 

3 



-0.0625 

-0.03125 

-0.0153 

-0.0431 

-0.0169 

4 





0.0000 

0.0198 

0.0267 


Table 3.4 Filter coefficients for 2x decimation. These filters are of odd length, are sym- 
metric, and are normalized to have unit DC gain (sum up to 1). See Figure 3.31 for their 
associated frequency responses. 
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Figure 3.31 Frequency response for some 2x decimation filters. The cubic a = — 1 filter 
has the sharpest fall-off but also a bit of ringing; the wavelet analysis filters (QMF-9 and 
JPEG 2000), while useful for compression, have more aliasing. 

• a cosine-windowed sine function (Table 3.2); 

• the QMF-9 filter of Simoncelli and Adelson (1990b) is used for wavelet denoising and 
aliases a fair amount (note that the original filter coefficients are normalized to \[2 gain 
so they can be “self-inverting”); 

• the 9/7 analysis filter from JPEG 2000 (Taubman and Marcellin 2002). 

Please see the original papers for the full-precision values of some of these coefficients. 


3.5.3 Multi- resolution representations 

Now that we have described interpolation and decimation algorithms, we can build a complete 
image pyramid (Figure 3.32). As we mentioned before, pyramids can be used to accelerate 
coarse-to-fine search algorithms, to look for objects or patterns at different scales, and to per- 
form multi-resolution blending operations. They are also widely used in computer graphics 
hardware and software to perform fractional-level decimation using the MIP-map, which we 
cover in Section 3.6. 

The best known (and probably most widely used) pyramid in computer vision is Burt 
and Adelson’s (1983a) Laplacian pyramid. To construct the pyramid, we first blur and sub- 
sample the original image by a factor of two and store this in the next level of the pyramid 
(Figure 3.33). Because adjacent levels in the pyramid are related by a sampling rate r = 2, 
this kind of pyramid is known as an octave pyramid. Burt and Adelson originally proposed a 
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coarse 



1 = 2 


1 = 1 


1 = 0 


Figure 3.32 A traditional image pyramid: each level has half the resolution (width and 
height), and hence a quarter of the pixels, of its parent level. 


five-tap kernel of the form 


c 

b 

a 

b 

c 


(3.82) 


with 6=1/4 and c = 1/4 

kernel. 


a/2. In practice, a = 3/8, which results in the familiar binomial 


1 

1(3 


4 

6 

4 

1 


(3.83) 


which is particularly easy to implement using shifts and adds. (This was important in the days 
when multipliers were expensive.) The reason they call their resulting pyramid a Gaussian 
pyramid is that repeated convolutions of the binomial kernel converge to a Gaussian. 16 

To compute the Laplacian pyramid, Burt and Adelson first interpolate a lower resolu- 
tion image to obtain a reconstructed low-pass version of the original image (Figure 3.34b). 
They then subtract this low-pass version from the original to yield the band-pass “Laplacian” 
image, which can be stored away for further processing. The resulting pyramid has perfect 
reconstruction , i.e., the Laplacian images plus the base-level Gaussian (L 2 in Figure 3.34b) 
are sufficient to exactly reconstruct the original image. Figure 3.33 shows the same com- 
putation in one dimension as a signal processing diagram, which completely captures the 
computations being performed during the analysis and re-synthesis stages. 

Burt and Adelson also describe a variant on the Laplacian pyramid, where the low-pass 
image is taken from the original blurred image rather than the reconstructed pyramid (piping 
the output of the L box directly to the subtraction in Figure 3.34b). This variant has less 


16 


Then again, this is true for any smoothing kernel (Wells 1986). 
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Figure 3.33 The Gaussian pyramid shown as a signal processing diagram: The (a) analysis 
and (b) re-synthesis stages are shown as using similar computations. The white circles in- 
dicate zero values inserted by the | 2 upsampling operation. Notice how the reconstruction 
filter coefficients are twice the analysis coefficients. The computation is shown as flowing 
down the page, regardless of whether we are going from coarse to fine or vice versa. 


aliasing, since it avoids one downsampling and upsampling round-trip, but it is not self- 
inverting, since the Laplacian images are no longer adequate to reproduce the original image. 

As with the Gaussian pyramid, the term Laplacian is a bit of a misnomer, since their 
band-pass images are really differences of (approximate) Gaussians, or DoGs, 

DoG{J; a u a 2 } = G ai * I - G a2 * I = (G ai - G a2 ) * I. (3.84) 

A Laplacian of Gaussian (which we saw in (3.26)) is actually its second derivative, 

LoG{7; er} = V 2 (G CT * I) = (V 2 G CT ) * I , (3.85) 


where 


d 2 d 2 

dx 2 dy 2 


(3.86) 


is the Laplacian (operator) of a function. Figure 3.35 shows how the Differences of Gaussian 
and Laplacians of Gaussian look in both space and frequency. 

Laplacians of Gaussian have elegant mathematical properties, which have been widely 
studied in the scale-space community (Witkin 1983; Witkin, Terzopoulos, and Kass 1986; 
Lindeberg 1990; Nielsen, Florack, and Deriche 1997) and can be used for a variety of appli- 
cations including edge detection (Marr and Hildreth 1980; Perona and Malik 1990b), stereo 
matching (Witkin, Terzopoulos, and Kass 1987), and image enhancement (Nielsen, Florack, 
and Deriche 1997). 

A less widely used variant is half-octave pyramids , shown in Figure 3.36a. These were 
first introduced to the vision community by Crowley and Stern (1984), who call them Dif- 
ference of Low-Pass (DOLP) transforms. Because of the small scale change between adja- 
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(a) 



(b) 


Figure 3.34 The Laplacian pyramid: (a) The conceptual flow of images through processing 
stages: images are high-pass and low-pass filtered, and the low-pass filtered images are pro- 
cessed in the next stage of the pyramid. During reconstruction, the interpolated image and the 
(optionally filtered) high-pass image are added back together. The Q box indicates quantiza- 
tion or some other pyramid processing, e.g., noise removal by coring (setting small wavelet 
values to 0). (b) The actual computation of the high-pass filter involves first interpolating the 
downsampled low-pass image and then subtracting it. This results in perfect reconstruction 
when Q is the identity. The high-pass (or band-pass) images are typically called Laplacian 
images, while the low-pass images are called Gaussian images. 
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Figure 3.35 The difference of two low-pass filters results in a band-pass filter. The dashed 
blue lines show the close fit to a half-octave Laplacian of Gaussian. 


cent levels, the authors claim that coarse-to-fine algorithms perform better. In the image- 
processing community, half-octave pyramids combined with checkerboard sampling grids 
are known as quincunx sampling (Feilner, Van De Ville, and Unser 2005). In detecting multi- 
scale features (Section 4.1.1), it is often common to use half-octave or even quarter-octave 
pyramids (Lowe 2004; Triggs 2004). However, in this case, the subsampling only occurs 
at every octave level, i.e., the image is repeatedly blurred with wider Gaussians until a full 
octave of resolution change has been achieved (Figure 4.11). 


3.5.4 Wavelets 

While pyramids are used extensively in computer vision applications, some people use wavelet 
decompositions as an alternative. Wavelets are filters that localize a signal in both space 
and frequency (like the Gabor filter in Table 3.2) and are defined over a hierarchy of scales. 
Wavelets provide a smooth way to decompose a signal into frequency components without 
blocking and are closely related to pyramids. 

Wavelets were originally developed in the applied math and signal processing communi- 
ties and were introduced to the computer vision community by Mallat (1989). Strang (1989); 
Simoncelli and Adelson (1990b); Rioul and Vetterli (1991); Chui (1992); Meyer (1993) all 
provide nice introductions to the subject along with historical reviews, while Chui (1992) pro- 
vides a more comprehensive review and survey of applications. Sweldens (1997) describes 
the more recent lifting approach to wavelets that we discuss shortly. 

Wavelets are widely used in the computer graphics community to perform multi-resolution 
geometric processing (Stollnitz, DeRose, and Salesin 1996) and have also been used in com- 
puter vision for similar applications (Szeliski 1990b; Pentland 1994; Gortler and Cohen 1995; 
Yaou and Chang 1994; Lai and Vemuri 1997; Szeliski 2006b), as well as for multi-scale ori- 
ented filtering (Simoncelli, Freeman, Adelson el al. 1992) and denoising (Portilla, Strela, 
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Figure 3.36 Multiresolution pyramids: (a) pyramid with half-octave ( quincunx ) sampling 
(odd levels are colored gray for clarity), (b) wavelet pyramid — each wavelet level stores 3/4 
of the original pixels (usually the horizontal, vertical, and mixed gradients), so that the total 
number of wavelet coefficients and original pixels is the same. 


Wainwright et al. 2003). 

Since both image pyramids and wavelets decompose an image into multi-resolution de- 
scriptions that are localized in both space and frequency, how do they differ? The usual 
answer is that traditional pyramids are overcomplete , i.e., they use more pixels than the orig- 
inal image to represent the decomposition, whereas wavelets provide a tight frame, i.e., they 
keep the size of the decomposition the same as the image (Figure 3.36b). However, some 
wavelet families are , in fact, overcomplete in order to provide better shiftability or steering in 
orientation (Simoncelli, Freeman, Adelson et al. 1992). A better distinction, therefore, might 
be that wavelets are more orientation selective than regular band-pass pyramids. 

How are two-dimensional wavelets constructed? Figure 3.37a shows a high-level dia- 
gram of one stage of the (recursive) coarse-to-fine construction (analysis) pipeline alongside 
the complementary re-construction (synthesis) stage. In this diagram, the high-pass filter 
followed by decimation keeps 3 /4 of the original pixels, while f,\ of the low-frequency coef- 
ficients are passed on to the next stage for further analysis. In practice, the filtering is usually 
broken down into two separable sub-stages, as shown in Figure 3.37b. The resulting three 
wavelet images are sometimes called the high-high (HH), high-low ( HL ), and low-high 
( LH ) images. The high-low and low-high images accentuate the horizontal and vertical 
edges and gradients, while the high-high image contains the less frequently occurring mixed 
derivatives. 

How are the high-pass H and low-pass L filters shown in Figure 3.37b chosen and how 
can the corresponding reconstruction filters I and F be computed? Can filters be designed 
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(a) 



(b) 


Figure 3.37 Two-dimensional wavelet decomposition: (a) high-level diagram showing the 
low -pass and high-pass transforms as single boxes; (b) separable implementation, which in- 
volves first performing the wavelet transform horizontally and then vertically. The I and F 
boxes are the interpolation and filtering boxes required to re-synthesize the image from its 
wavelet components. 


that all have finite impulse responses? This topic has been the main subject of study in the 
wavelet community for over two decades. The answer depends largely on the intended ap- 
plication, e.g., whether the wavelets are being used for compression, image analysis (feature 
finding), or denoising. Simoncelli and Adelson (1990b) show (in Table 4.1) some good odd- 
length quadrature mirror filter (QMF) coefficients that seem to work well in practice. 

Since the design of wavelet filters is such a tricky art, is there perhaps a better way? In- 
deed, a simpler procedure is to split the signal into its even and odd components and then 
perform trivially reversible filtering operations on each sequence to produce what are called 
lifted wavelets (Figures 3.38 and 3.39). Sweldens (1996) gives a wonderfully understandable 
introduction to the lifting scheme for second-generation wavelets , followed by a comprehen- 
sive review (Sweldens 1997). 

As Figure 3.38 demonstrates, rather than first filtering the whole input sequence (image) 
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(a) 



(b) 


Figure 3.38 One-dimensional wavelet transform: (a) usual high-pass + low-pass filters fol- 
lowed by odd (I 2 0 ) and even (J, 2 e ) downsampling; (b) lifted version, which first selects the 
odd and even subsequences and then applies a low-pass prediction stage L and a high-pass 
correction stage C in an easily reversible manner. 


with high-pass and low-pass filters and then keeping the odd and even sub-sequences, the 
lifting scheme first splits the sequence into its even and odd sub-components. Filtering the 
even sequence with a low-pass filter L and subtracting the result from the even sequence 
is trivially reversible: simply perform the same filtering and then add the result back in. 
Furthermore, this operation can be performed in place, resulting in significant space savings. 
The same applies to filtering the even sequence with the correction filter C, which is used to 
ensure that the even sequence is low-pass. A series of such lifting steps can be used to create 
more complex filter responses with low computational cost and guaranteed reversibility. 

This process can perhaps be more easily understood by considering the signal processing 
diagram in Figure 3.39. During analysis, the average of the even values is subtracted from the 
odd value to obtain a high-pass wavelet coefficient. However, the even samples still contain 
an aliased sample of the low-frequency signal. To compensate for this, a small amount of the 
high-pass wavelet is added back to the even sequence so that it is properly low-pass filtered. 
(It is easy to show that the effective low -pass filter is [— 1 /s, 1 /i, 3 /4, Vf — Vs], which is in- 
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Figure 3.39 Lifted transform shown as a signal processing diagram: (a) The analysis stage 
first predicts the odd value from its even neighbors, stores the difference wavelet, and then 
compensates the coarser even value by adding in a fraction of the wavelet, (b) The synthesis 
stage simply reverses the flow of computation and the signs of some of the filters and op- 
erations. The light blue lines show what happens if we use four taps for the prediction and 
correction instead of just two. 


deed a low-pass filter.) During synthesis, the same operations are reversed with a judicious 
change in sign. 

Of course, we need not restrict ourselves to two-tap filters. Figure 3.39 shows as light 
blue arrows additional filter coefficients that could optionally be added to the lifting scheme 
without affecting its reversibility. In fact, the low-pass and high-pass filtering operations can 
be interchanged, e.g., we could use a five-tap cubic low-pass filter on the odd sequence (plus 
center value) first, followed by a four-tap cubic low-pass predictor to estimate the wavelet, 
although I have not seen this scheme written down. 

Lifted wavelets are called second-generation wavelets because they can easily adapt to 
non-regular sampling topologies, e.g., those that arise in computer graphics applications such 
as multi-resolution surface manipulation (Schroder and Sweldens 1995). It also turns out that 
lifted weighted wavelets , i.e., wavelets whose coefficients adapt to the underlying problem 
being solved (Fattal 2009), can be extremely effective for low-level image manipulation tasks 
and also for preconditioning the kinds of sparse linear systems that arise in the optimization- 
based approaches to vision algorithms that we discuss in Section 3.7 (Szeliski 2006b). 

An alternative to the widely used “separable” approach to wavelet construction, which de- 
composes each level into horizontal, vertical, and “cross” sub-bands, is to use a representation 
that is more rotationally symmetric and orientationally selective and also avoids the aliasing 
inherent in sampling signals below their Nyquist frequency. 17 Simoncelli, Freeman, Adelson 
et al. (1992) introduce such a representation, which they call a pyramidal radial frequency 

17 Such aliasing can often be seen as the signal content moving between bands as the original signal is slowly 
shifted. 
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Figure 3.40 Steerable shiftable multiscale transforms (Simoncelli, Freeman, Adelson et al. 
1992) © 1992 IEEE: (a) radial multi-scale frequency domain decomposition; (b) original 
image; (c) a set of four steerable filters; (d) the radial multi-scale wavelet decomposition. 


implementation of shiftable multi-scale transforms or, more succinctly, steerable pyramids. 
Their representation is not only overcomplete (which eliminates the aliasing problem) but is 
also orientationally selective and has identical analysis and synthesis basis functions, i.e., it is 
self-inverting, just like “regular” wavelets. As a result, this makes steerable pyramids a much 
more useful basis for the structural analysis and matching tasks commonly used in computer 
vision. 

Figure 3.40a shows how such a decomposition looks in frequency space. Instead of re- 
cursively dividing the frequency domain into 2x2 squares, which results in checkerboard 
high frequencies, radial arcs are used instead. Figure 3.40b illustrates the resulting pyramid 
sub-bands. Even through the representation is overcomplete, i.e., there are more wavelet co- 
efficients than input pixels, the additional frequency and orientation selectivity makes this 
representation preferable for tasks such as texture analysis and synthesis (Portilla and Simon- 
celli 2000) and image denoising (Portilla, Strela, Wain wright el al. 2003; Lyu and Simoncelli 
2009). 
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Figure 3.41 Laplacian pyramid blending (Burt and Adelson 1983b) © 1983 ACM: (a) orig- 
inal image of apple, (b) original image of orange, (c) regular splice, (d) pyramid blend. 


3.5.5 Application : Image blending 

One of the most engaging and fun applications of the Laplacian pyramid presented in Sec- 
tion 3.5.3 is the creation of blended composite images, as shown in Figure 3.41 (Burt and 
Adelson 1983b). While splicing the apple and orange images together along the midline 
produces a noticeable cut, splining them together (as Burt and Adelson (1983b) called their 
procedure) creates a beautiful illusion of a truly hybrid fruit. The key to their approach is 
that the low-frequency color variations between the red apple and the orange are smoothly 
blended, while the higher-frequency textures on each fruit are blended more quickly to avoid 
“ghosting” effects when two textures are overlaid. 

To create the blended image, each source image is first decomposed into its own Lapla- 
cian pyramid (Figure 3.42, left and middle columns). Each band is then multiplied by a 
smooth weighting function whose extent is proportional to the pyramid level. The simplest 
and most general way to create these weights is to take a binary mask image (Figure 3.43c) 
and to construct a Gaussian pyramid from this mask. Each Laplacian pyramid image is then 
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Figure 3.42 Laplacian pyramid blending details (Burt and Adelson 1983b) © 1983 ACM. 
The first three rows show the high, medium, and low frequency parts of the Laplacian pyramid 
(taken from levels 0, 2, and 4). The left and middle columns show the original apple and 
orange images weighted by the smooth interpolation functions, while the right column shows 
the averaged contributions. 
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(a) 



(c) 


Figure 3.43 Laplacian pyramid blend of two images of arbitrary shape (Burt and Adelson 
1983b) © 1983 ACM: (a) first input image; (b) second input image; (c) region mask; (d) 
blended image. 



multiplied by its corresponding Gaussian mask and the sum of these two weighted pyramids 
is then used to construct the final image (Figure 3.42, right column). 

Figure 3.43 shows that this process can be applied to arbitrary mask images with sur- 
prising results. It is also straightforward to extend the pyramid blend to an arbitrary number 
of images whose pixel provenance is indicated by an integer-valued label image (see Exer- 
cise 3.20). This is particularly useful in image stitching and compositing applications, where 
the exposures may vary between different images, as described in Section 9.3.4. 


3.6 Geometric transformations 

In the previous sections, we saw how interpolation and decimation could be used to change 
the resolution of an image. In this section, we look at how to perform more general transfor- 
mations, such as image rotations or general warps. In contrast to the point processes we saw 
in Section 3.1, where the function applied to an image transforms the range of the image. 


g{x) = h(f{x)), 


(3.87) 
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Figure 3.44 Image warping involves modifying the domain of an image function rather than 
its range. 



Figure 3.45 Basic set of 2D geometric image transformations. 


here we look at functions that transform the domain. 


g(x) = f(h(x)) 


(3.88) 


(see Figure 3.44). 

We begin by studying the global parametric 2D transformation first introduced in Sec- 
tion 2.1.2. (Such a transformation is called parametric because it is controlled by a small 
number of parameters.) We then turn our attention to more local general deformations such as 
those defined on meshes (Section 3.6.2). Finally, we show how image warps can be combined 
with cross-dissolves to create interesting morphs (in-between animations) in Section 3.6.3. 
For readers interested in more details on these topics, there is an excellent survey by Heck- 
bert (1986) as well as very accessible textbooks by Wolberg (1990), Gomes, Darsa, Costa 
et al. (1999) and Akenine-Moller and Haines (2002). Note that Heckbert’s survey is on tex- 
ture mapping, which is how the computer graphics community refers to the topic of warping 
images onto surfaces. 

3.6.1 Parametric transformations 

Parametric transformations apply a global deformation to an image, where the behavior of the 
transformation is controlled by a small number of parameters. Figure 3.45 shows a few ex- 
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Transformation 

Matrix 

# DoF 

Preserves 

Icon 

translation 

I 1 


2 

orientation 




L J 

2x3 





rigid (Euclidean) 

R | t 

2x3 

3 

lengths 

< 

0 

similarity 

sR t 

2x3 

4 

angles 


0 

affine 

A 


6 

parallelism 

n 


J 2x3 


projective 



J 3x3 


8 


straight lines 



Table 3.5 Hierarchy of 2D coordinate transformations. Each transformation also preserves 
the properties listed in the rows below it, i.e., similarity preserves not only angles but also 
parallelism and straight lines. The 2x3 matrices are extended with a third [0 T 1] row to form 
a full 3x3 matrix for homogeneous coordinate transformations. 


amples of such transformations, which are based on the 2D geometric transformations shown 
in Figure 2.4. The formulas for these transformations were originally given in Table 2.1 and 
are reproduced here in Table 3.5 for ease of reference. 

In general, given a transformation specified by a formula x' = h(x) and a source image 
f(x), how do we compute the values of the pixels in the new image g(x), as given in (3.88)? 
Think about this for a minute before proceeding and see if you can figure it out. 

If you are like most people, you will come up with an algorithm that looks something like 
Algorithm 3.1. This process is called forward warping or forward mapping and is shown in 
Figure 3.46a. Can you think of any problems with this approach? 


procedure forwardWarpi f. h. out g ): 

For every pixel x in f(x) 

1. Compute the destination location x' = hix). 

2. Copy the pixel f(x) to g(x'). 


Algorithm 3.1 Forward warping algorithm for transforming an image /( x) into an image 
g(x') through the parametric transform x' = h(x). 
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Figure 3.46 Forward warping algorithm: (a) a pixel f(x) is copied to its corresponding 
location x' = h(x) in image g{x')\ (b) detail of the source and destination pixel locations. 


In fact, this approach suffers from several limitations. The process of copying a pixel 
fix) to a location x' in g is not well defined when x' has a non-integer value. What do we 
do in such a case? What would you do? 

You can round the value of x' to the nearest integer coordinate and copy the pixel there, 
but the resulting image has severe aliasing and pixels that jump around a lot when animating 
the transformation. You can also “distribute” the value among its four nearest neighbors in 
a weighted (bilinear) fashion, keeping track of the per-pixel weights and normalizing at the 
end. This technique is called splatting and is sometimes used for volume rendering in the 
graphics community (Levoy and Whitted 1985; Levoy 1988; Westover 1989; Rusinkiewicz 
and Levoy 2000). Unfortunately, it suffers from both moderate amounts of aliasing and a 
fair amount of blur (loss of high-resolution detail). 

The second major problem with forward warping is the appearance of cracks and holes, 
especially when magnifying an image. Filling such holes with their nearby neighbors can 
lead to further aliasing and blurring. 

What can we do instead? A preferable solution is to use inverse warping (Algorithm 3.2), 
where each pixel in the destination image g{x') is sampled from the original image /( x) 
(Figure 3.47). 

How does this differ from the forward warping algorithm? For one thing, since h(x') 
is (presumably) defined for all pixels in g(x'), we no longer have holes. More importantly, 
resampling an image at non-integer locations is a well-studied problem (general image inter- 
polation, see Section 3.5.2) and high-quality biters that control aliasing can be used. 

Where does the function h(x') come from? Quite often, it can simply be computed as the 
inverse of h{x). In fact, all of the parametric transforms listed in Table 3.5 have closed form 
solutions for the inverse transform: simply take the inverse of the 3x3 matrix specifying the 
transform. 

In other cases, it is preferable to formulate the problem of image warping as that of re- 
sampling a source image f(x) given a mapping x = h(x') from destination pixels x' to 
source pixels x. For example, in optical bow (Section 8.4), we estimate the bow held as the 
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procedure inverseWarp(f, h. out g): 

For every pixel x' in g(x') 

1. Compute the source location x = h(x') 

2. Resample f(x ) at location x and copy to g(x') 

Algorithm 3.2 Inverse warping algorithm for creating an image g(x') from an image /( x) 
using the parametric transform x' = hix). 



(a) (b) 


Figure 3.47 Inverse warping algorithm: (a) a pixel g(x') is sampled from its corresponding 
location x = h(x') in image f(x); (b) detail of the source and destination pixel locations. 


location of the source pixel which produced the current pixel whose flow is being estimated, 
as opposed to computing the destination pixel to which it is going. Similarly, when correcting 
for radial distortion (Section 2.1.6), we calibrate the lens by computing for each pixel in the 
final (undistorted) image the corresponding pixel location in the original (distorted) image. 

What kinds of interpolation filter are suitable for the resampling process? Any of the fil- 
ters we studied in Section 3.5.2 can be used, including nearest neighbor, bilinear, bicubic, and 
windowed sine functions. While bilinear is often used for speed (e.g., inside the inner loop 
of a patch-tracking algorithm, see Section 8.1.3), bicubic, and windowed sine are preferable 
where visual quality is important. 

To compute the value of f(x) at a non-integer location x, we simply apply our usual FIR 
resampling filter, 

g(x,y) = ^2f(k,l)h(x-k,y-l), (3.89) 

k,l 

where ( x , y ) are the sub-pixel coordinate values and h(x, y) is some interpolating or smooth- 
ing kernel. Recall from Section 3.5.2 that when decimation is being performed, the smoothing 
kernel is stretched and re-scaled according to the downsampling rate r. 

Unfortunately, for a general (non-zoom) image transformation, the resampling rate r is 
not well defined. Consider a transformation that stretches the x dimensions while squashing 
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Figure 3.48 Anisotropic texture filtering: (a) Jacobian of transform A and the induced 
horizontal and vertical resampling rates {a x ' x , a x i y , a y > x , a y ' y }\ (b) elliptical footprint of an 
EWA smoothing kernel; (c) anisotropic filtering using multiple samples along the major axis. 
Image pixels lie at line intersections. 

the y dimensions. The resampling kernel should be performing regular interpolation along 
the x dimension and smoothing (to anti-alias the blurred image) in the y direction. This gets 
even more complicated for the case of general affine or perspective transforms. 

What can we do? Fortunately, Fourier analysis can help. The two-dimensional general- 
ization of the one-dimensional domain scaling law given in Table 3.1 is 

g(Ax) & \A\~ l G{A~ T f). (3.90) 

For all of the transforms in Table 3.5 except perspective, the matrix A is already defined. 
For perspective transformations, the matrix A is the linearized derivative of the perspective 
transformation (Figure 3.48a), i.e., the local affine approximation to the stretching induced 
by the projection (Heckbert 1986; Wolberg 1990; Gomes, Darsa, Costa et al. 1999; Akenine- 
Moller and Haines 2002). 

To prevent aliasing, we need to pre-filter the image f(x) with a filter whose frequency 
response is the projection of the final desired spectrum through the A 7 transform (Szeliski, 
Winder, and Uyttendaele 2010). In general (for non-zoom transforms), this filter is non- 
separable and hence is very slow to compute. Therefore, a number of approximations to this 
filter are used in practice, include MIP-mapping, elliptically weighted Gaussian averaging, 
and anisotropic filtering (Akenine-Moller and Haines 2002). 

MIP-mapping 

MIP-mapping was first proposed by Williams (1983) as a means to rapidly pre-filter images 
being used for texture mapping in computer graphics. A MIP-map 18 is a standard image 

18 


The term ‘MIP’ stands for multi in parvo, meaning ‘many in one’. 
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pyramid (Figure 3.32), where each level is pre-filtered with a high-quality filter rather than 
a poorer quality approximation, such as Burt and Adelson’s (1983b) five-tap binomial. To 
resample an image from a MIP-map, a scalar estimate of the resampling rate r is first com- 
puted. For example, r can be the maximum of the absolute values in A (which suppresses 
aliasing) or it can be the minimum (which reduces blurring). Akenine-Moller and Haines 
(2002) discuss these issues in more detail. 

Once a resampling rate has been specified, a fractional pyramid level is computed using 
the base 2 logarithm, 

/ = log 2 r. (3.91) 

One simple solution is to resample the texture from the next higher or lower pyramid level, 
depending on whether it is preferable to reduce aliasing or blur. A better solution is to re- 
sample both images and blend them linearly using the fractional component of l. Since most 
MIP-map implementations use bilinear resampling within each level, this approach is usu- 
ally called trilinear MIP -mapping. Computer graphics rendering APIs, such as OpenGL and 
Direct3D, have parameters that can be used to select which variant of MIP-mapping (and of 
the sampling rate r computation) should be used, depending on the desired tradeoff between 
speed and quality. Exercise 3.22 has you examine some of these tradeoffs in more detail. 

Elliptical Weighted Average 

The Elliptical Weighted Average (EWA) filter invented by Greene and Heckbert (1986) is 
based on the observation that the affine mapping x = Ax' defines a skewed two-dimensional 
coordinate system in the vicinity of each source pixel x (Figure 3.48a). For every destina- 
tion pixel x! , the ellipsoidal projection of a small pixel grid in x' onto x is computed (Fig- 
ure 3.48b). This is then used to filter the source image g(x) with a Gaussian whose inverse 
covariance matrix is this ellipsoid. 

Despite its reputation as a high-quality filter (Akenine-Moller and Haines 2002), we have 
found in our work (Szeliski, Winder, and Uyttendaele 2010) that because a Gaussian kernel 
is used, the technique suffers simultaneously from both blurring and aliasing, compared to 
higher-quality filters. The EWA is also quite slow, although faster variants based on MIP- 
mapping have been proposed (Szeliski, Winder, and Uyttendaele (2010) provide some addi- 
tional references). 

Anisotropic filtering 

An alternative approach to filtering oriented textures, which is sometimes implemented in 
graphics hardware (GPUs), is to use anisotropic filtering (Barkans 1997; Akenine-Moller and 
Haines 2002). In this approach, several samples at different resolutions (fractional levels in 
the MIP-map) are combined along the major axis of the EWA Gaussian (Figure 3.48c). 
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Figure 3.49 One-dimensional signal resampling (Szeliski, Winder, and Uyttendaele 2010): 
(a) original sampled signal /(i); (b) interpolated signal gi(x)\ (c) warped signal 52 ( 2 ;); (d) 
filtered signal 53 ( 2 :); (e) sampled signal The corresponding spectra are shown below 

the signals, with the aliased portions shown in red. 


Multi-pass transforms 

The optimal approach to warping images without excessive blurring or aliasing is to adap- 
tively pre-filter the source image at each pixel using an ideal low-pass filter, i.e., an oriented 
skewed sine or low-order (e.g., cubic) approximation (Figure 3.48a). Figure 3.49 shows how 
this works in one dimension. The signal is first (theoretically) interpolated to a continuous 
waveform, (ideally) low-pass filtered to below the new Nyquist rate, and then re-sampled to 
the final desired resolution. In practice, the interpolation and decimation steps are concate- 
nated into a single polyphase digital filtering operation (Szeliski, Winder, and Uyttendaele 
2010). 

For parametric transforms, the oriented two-dimensional filtering and resampling opera- 
tions can be approximated using a series of one-dimensional resampling and shearing trans- 
forms (Catmull and Smith 1980; Heckbert 1989; Wolberg 1990; Gomes, Darsa, Costa el al. 
1999; Szeliski, Winder, and Uyttendaele 2010). The advantage of using a series of one- 
dimensional transforms is that they are much more efficient (in terms of basic arithmetic 
operations) than large, non-separable, two-dimensional filter kernels. 

In order to prevent aliasing, however, it may be necessary to upsample in the opposite di- 
rection before applying a shearing transformation (Szeliski, Winder, and Uyttendaele 2010). 
Figure 3.50 shows this process for a rotation, where a vertical upsampling stage is added be- 
fore the horizontal shearing (and upsampling) stage. The upper image shows the appearance 
of the letter being rotated, while the lower image shows its corresponding Fourier transform. 
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Figure 3.50 Four-pass rotation (Szeliski, Winder, and Uyttendaele 2010): (a) original pixel 
grid, image, and its Fourier transform; (b) vertical upsampling; (c) horizontal shear and up- 
sampling; (d) vertical shear and downsampling; (e) horizontal downsampling. The general 
affine case looks similar except that the first two stages perform general resampling. 


3.6.2 Mesh-based warping 

While parametric transforms specified by a small number of global parameters have many 
uses, local deformations with more degrees of freedom are often required. 

Consider, for example, changing the appearance of a face from a frown to a smile (Fig- 
ure 3.51a). What is needed in this case is to curve the corners of the mouth upwards while 
leaving the rest of the face intact. 19 To perform such a transformation, different amounts of 
motion are required in different parts of the image. Figure 3.51 shows some of the commonly 
used approaches. 

The first approach, shown in Figure 3.51a-b, is to specify a sparse set of corresponding 
points. The displacement of these points can then be interpolated to a dense displacement field 
(Chapter 8) using a variety of techniques (Nielson 1993). One possibility is to triangulate 
the set of points in one image (de Berg, Cheong, van Kreveld et al. 2006; Litwinowicz and 
Williams 1994; Buck, Finkelstein, Jacobs et al. 2000) and to use an affine motion model 
(Table 3.5), specified by the three triangle vertices, inside each triangle. If the destination 

19 Rowland andPerrett (1995); Pighin, Hecker, Lischinski etal. (1998); Blanz and Vetter (1999); Leyvand, Cohen- 
Or, Dror et al. (2008) show more sophisticated examples of changing facial expression and appearance. 
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(c) (d) 


Figure 3.51 Image warping alternatives (Gomes, Darsa, Costa et al. 1999) © 1999 Morgan 
Kaufmann: (a) sparse control points — * deformation grid; (b) denser set of control point 
correspondences; (c) oriented line correspondences; (d) uniform quadrilateral grid. 


image is triangulated according to the new vertex locations, an inverse warping algorithm 
(Figure 3.47) can be used. If the source image is triangulated and used as a texture map, 
computer graphics rendering algorithms can be used to draw the new image (but care must 
be taken along triangle edges to avoid potential aliasing). 

Alternative methods for interpolating a sparse set of displacements include moving nearby 
quadrilateral mesh vertices, as shown in Figure 3.51a, using variational (energy minimizing) 
interpolants such as regularization (Litwinowicz and Williams 1994), see Section 3.7.1, or 
using locally weighted ( radial basis function) combinations of displacements (Nielson 1993). 
(See (Section 12.3.1) for additional scattered data interpolation techniques.) If quadrilateral 
meshes are used, it may be desirable to interpolate displacements down to individual pixel 
values using a smooth interpolant such as a quadratic B-spline (Farin 1996; Lee, Wolberg, 
Chwa et al. 1996). 20 

In some cases, e.g., if a dense depth map has been estimated for an image (Shade, Gortler, 
He et al. 1998), we only know the forward displacement for each pixel. As mentioned before, 
drawing source pixels at their destination location, i.e., forward warping (Figure 3.46), suffers 
from several potential problems, including aliasing and the appearance of small cracks. An 
alternative technique in this case is to forward warp the displacement field (or depth map) to 

Note that the block-based motion models used by many video compression standards (Le Gall 1991) can be 
thought of as a Oth-order (piecewise-constant) displacement field. 
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(b) (c) 


Figure 3.52 Line-based image warping (Beier and Neely 1992) © 1992 ACM: (a) distance 
computation and position transfer; (b) rendering algorithm; (c) two intermediate warps used 
for morphing. 




its new location, fill small holes in the resulting map, and then use inverse warping to perform 
the resampling (Shade, Gortler, He el al. 1998). The reason that this generally works better 
than forward warping is that displacement fields tend to be much smoother than images, so 
the aliasing introduced during the forward warping of the displacement field is much less 
noticeable. 

A second approach to specifying displacements for local deformations is to use corre- 
sponding oriented line segments (Beier and Neely 1992), as shown in Figures 3.51c and 3.52. 
Pixels along each line segment are transferred from source to destination exactly as specified, 
and other pixels are warped using a smooth interpolation of these displacements. Each line 
segment correspondence specifies a translation, rotation, and scaling, i.e., a similarity trans- 
form (Table 3.5), for pixels in its vicinity, as shown in Figure 3.52a. Line segments influence 
the overall displacement of the image using a weighting function that depends on the mini- 
mum distance to the line segment ( v in Figure 3.52a if u £ [0, 1], else the shorter of the two 
distances to P and Q). 

For each pixel X , the target location X 1 for each line correspondence is computed along 
with a weight that depends on the distance and the line segment length (Figure 3.52b). The 
weighted average of all target locations X' then becomes the final destination location. Note 
that while Beier and Neely describe this algorithm as a forward warp, an equivalent algorithm 
can be written by sequencing through the destination pixels. The resulting warps are not 
identical because line lengths or distances to lines may be different. Exercise 3.23 has you 
implement the Beier-Neely (line -based) warp and compare it to a number of other local 
deformation methods. 

Yet another way of specifying correspondences in order to create image warps is to use 
snakes (Section 5.1.1) combined with B-splines (Lee, Wolberg, Chwa el al. 1996). This tech- 
nique is used in Apple’s Shake software and is popular in the medical imaging community. 
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Figure 3.53 Image morphing (Gomes, Darsa, Costa et al. 1999) © 1999 Morgan Kaufmann. 
Top row: if the two images are just blended, visible ghosting results. Bottom row: both 
images are first warped to the same intermediate location (e.g., halfway towards the other 
image) and the resulting warped images are then blended resulting in a seamless morph. 


One final possibility for specifying displacement fields is to use a mesh specifically 
adapted to the underlying image content, as shown in Figure 3.5 Id. Specifying such meshes 
by hand can involve a fair amount of work; Gomes, Darsa, Costa et al. (1999) describe an 
interactive system for doing this. Once the two meshes have been specified, intermediate 
warps can be generated using linear interpolation and the displacements at mesh nodes can 
be interpolated using splines. 

3.6.3 Application : Feature-based morphing 

While warps can be used to change the appearance of or to animate a single image, even 
more powerful effects can be obtained by warping and blending two or more images using a 
process now commonly known as morphing (Beier and Neely 1992; Lee, Wolberg, Chwa et 
al. 1996; Gomes, Darsa, Costa et al. 1999). 

Figure 3.53 shows the essence of image morphing. Instead of simply cross-dissolving 
between two images, which leads to ghosting as shown in the top row, each image is warped 
toward the other image before blending, as shown in the bottom row. If the correspondences 
have been set up well (using any of the techniques shown in Figure 3.51), corresponding 
features are aligned and no ghosting results. 

The above process is repeated for each intermediate frame being generated during a 
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morph, using different blends (and amounts of deformation) at each interval. Let t £ [0, 1] be 
the time parameter that describes the sequence of interpolated frames. The weighting func- 
tions for the two warped images in the blend go as (1 — t) and t. Conversely, the amount of 
motion that image 0 undergoes at time t is t of the total amount of motion that is specified 
by the correspondences. However, some care must be taken in defining what it means to par- 
tially warp an image towards a destination, especially if the desired motion is far from linear 
(Sederberg, Gao, Wang el al. 1993). Exercise 3.25 has you implement a morphing algorithm 
and test it out under such challenging conditions. 

3.7 Global optimization 

So far in this chapter, we have covered a large number of image processing operators that 
take as input one or more images and produce some filtered or transformed version of these 
images. In many applications, it is more useful to first formulate the goals of the desired 
transformation using some optimization criterion and then find or infer the solution that best 
meets this criterion. 

In this final section, we present two different (but closely related) variants on this idea. 
The first, which is often called regularization or variational methods (Section 3.7.1), con- 
structs a continuous global energy function that describes the desired characteristics of the 
solution and then finds a minimum energy solution using sparse linear systems or related 
iterative techniques. The second formulates the problem using Bayesian statistics, model- 
ing both the noisy measurement process that produced the input images as well as prior 
assumptions about the solution space, which are often encoded using a Markov random field 
(Section 3.7.2). 

Examples of such problems include surface interpolation from scattered data (Figure 3.54), 
image denoising and the restoration of missing regions (Figure 3.57), and the segmentation 
of images into foreground and background regions (Figure 3.61). 

3.7.1 Regularization 

The theory of regularization was first developed by statisticians trying to fit models to data 
that severely underconstrained the solution space (Tikhonov and Arsenin 1977; Engl, Hanke, 
and Neubauer 1996). Consider, for example, finding a smooth surface that passes through 
(or near) a set of measured data points (Figure 3.54). Such a problem is described as ill- 
posed because many possible surfaces can fit this data. Since small changes in the input can 
sometimes lead to large changes in the fit (e.g., if we use polynomial interpolation), such 
problems are also often ill-conditioned. Since we are trying to recover the unknown function 
f(x, y) from which the data point d,(xi, yf) were sampled, such problems are also often called 
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Figure 3.54 A simple surface interpolation problem: (a) nine data points of various height 
scattered on a grid; (b) second-order, controlled-continuity, thin-plate spline interpolator, with 
a tear along its left edge and a crease along its right (Szeliski 1989) © 1989 Springer. 


inverse problems. Many computer vision tasks can be viewed as inverse problems, since we 
are trying to recover a full description of the 3D world from a limited set of images. 

In order to quantify what it means to find a smooth solution, we can define a norm on 
the solution space. For one-dimensional functions /(x), we can integrate the squared first 
derivative of the function, 

£i = Jfz(x)dx (3.92) 

or perhaps integrate the squared second derivative, 

&2 = j f‘L(x)dx. (3.93) 

(Here, we use subscripts to denote differentiation.) Such energy measures are examples of 
functionals , which are operators that map functions to scalar values. They are also often called 
variational methods, because they measure the variation (non-smoothness) in a function. 

In two dimensions (e.g., for images, flow fields, or surfaces), the corresponding smooth- 
ness functionals are 

£i = j fx(x,y) + fy(x,y)dxdy = j ||V /(x,y)|| 2 dx dy (3.94) 

and 

£2 = J fxx(x,y) + 2 fl v {x,y) + fy y (x,y) dxdy, (3.95) 

where the mixed 2 /J term is needed to make the measure rotationally invariant (Grimson 
1983). 

The first derivative norm is often called the membrane , since interpolating a set of data 
points using this measure results in a tent-like structure. (In fact, this formula is a small- 
deflection approximation to the surface area, which is what soap bubbles minimize.) The 
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second-order norm is called the thin-plate spline, since it approximates the behavior of thin 
plates (e.g., flexible steel) under small deformations. A blend of the two is called the thin- 
plate spline under tension; versions of these formulas where each derivative term is mul- 
tiplied by a local weighting function are called controlled-continuity splines (Terzopoulos 
1988). Figure 3.54 shows a simple example of a controlled-continuity interpolator fit to nine 
scattered data points. In practice, it is more common to find first-order smoothness terms 
used with images and flow fields (Section 8.4) and second-order smoothness associated with 
surfaces (Section 12.3.1). 

In addition to the smoothness term, regularization also requires a data term (or data 
penalty). For scattered data interpolation (Nielson 1993), the data term measures the dis- 
tance between the function f(x, y) and a set of data points di = d(xi,yi). 


For a problem like noise removal, a continuous version of this measure can be used. 


To obtain a global energy that can be minimized, the two energy terms are usually added 
together. 


where 8 S is the smoothness penalty ( 8 \ , 82 or some weighted blend) and A is the regulariza- 
tion parameter, which controls how smooth the solution should be. 

In order to find the minimum of this continuous problem, the function /( x, y) is usually 
first discretized on a regular grid. 21 The most principled way to perform this discretization is 
to use finite element analysis, i.e., to approximate the function with a piecewise continuous 
spline, and then perform the analytic integration (Bathe 2007). 

Fortunately, for both the first-order and second-order smoothness functionals, the judi- 
cious selection of appropriate finite elements results in particularly simple discrete forms 
(Terzopoulos 1983). The corresponding discrete smoothness energy functions become 



(3.96) 



(3.97) 


8 — 8 d + A 8 S , 


(3.98) 



(3.99) 


+ s y {i,j)[f(i,j + 1 ) - f(i,j) - 9 y (i,j)] 2 


and 



(3.100) 


21 The alternative of using kernel basis functions centered on the data points (Boult and Kender 1986; Nielson 
1993) is discussed in more detail in Section 12.3.1. 
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+ 2 Cm(i,j)[f(i + 1,7 + 1) - f(i + 1 ,j) ~ f(i,j + 1) + f(i, j)] 2 
+ c v {ij)[f(i,j + 1) - 2 f(i,j) + f(i,j - l)] 2 , 

where h is the size of the finite element grid. The h factor is only important if the energy is 
being discretized at a variety of resolutions, as in coarse-to-fine or multigrid techniques. 

The optional smoothness weights s x (i,j) and s v (i,j) control the location of horizon- 
tal and vertical tears (or weaknesses) in the surface. For other problems, such as coloriza- 
tion (Levin, Lischinski, and Weiss 2004) and interactive tone mapping (Lischinski, Farbman, 
Uyttendaele et al. 2006a), they control the smoothness in the interpolated chroma or expo- 
sure field and are often set inversely proportional to the local luminance gradient strength. 
For second-order problems, the crease variables c x (i,j), c m (i,j), and c y (i,j) control the 
locations of creases in the surface (Terzopoulos 1988; Szeliski 1990a). 

The data values g x (i,j) and g y (i,j) are gradient data terms (constraints) used by al- 
gorithms, such as photometric stereo (Section 12.1.1), HDR tone mapping (Section 10.2.1) 
(Fattal, Lischinski, and Werman 2002), Poisson blending (Section 9.3.4) (Perez, Gangnet, 
and Blake 2003), and gradient-domain blending (Section 9.3.4) (Levin, Zomet, Peleg et al. 
2004). They are set to zero when just discretizing the conventional first-order smoothness 
functional (3.94). 

The two-dimensional discrete data energy is written as 

E d = Y^w(i,j)[f(i,j) - d(i,j)] 2 , (3.101) 

where the local weights w(i,j) control how strongly the data constraint is enforced. These 
values are set to zero where there is no data and can be set to the inverse variance of the data 
measurements when there is data (as discussed by Szeliski (1989) and in Section 3.7.2). 

The total energy of the discretized problem can now be written as a quadratic form 

E = E d + A E s = x 1 Ax — 2x T b + c, (3.102) 

where x = [/( 0, 0) . . . f(m — 1, n — 1)] is called the state vector. 21 

The sparse symmetric positive -definite matrix A is called the Hessian since it encodes the 
second derivative of the energy function. 2 ' For the one -dimensional, first-order problem, A 
is tridiagonal; for the two-dimensional, first-order problem, it is multi-banded with five non- 
zero entries per row. We call b the weighted data vector. Minimizing the above quadratic 
form is equivalent to solving the sparse linear system 

Ax = b, (3.103) 

22 We use x instead of / because this is the more common form in the numerical analysis literature (Golub and 
Van Loan 1996). 

23 In numerical analysis, A is called the coefficient matrix (Saad 2003); in finite element analysis (Bathe 2007), it 
is called the stiffness matrix. 
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Figure 3.55 Graphical model interpretation of first-order regularization. The white circles 
are the unknowns f(i,j) while the dark circles are the input data d(i,j). In the resistive grid 
interpretation, the d and / values encode input and output voltages and the black squares 
denote resistors whose conductance is set to s x (i,j), s y (i,j), and ui(i, j). In the spring-mass 
system analogy, the circles denote elevations and the black squares denote springs. The same 
graphical model can be used to depict a first-order Markov random field (Figure 3.56). 


which can be done using a variety of sparse matrix techniques, such as multigrid (Briggs, 
Henson, and McCormick 2000) and hierarchical preconditioners (Szeliski 2006b), as de- 
scribed in Appendix A. 5. 

While regularization was first introduced to the vision community by Poggio, Torre, and 
Koch (1985) and Terzopoulos (1986b) for problems such as surface interpolation, it was 
quickly adopted by other vision researchers for such varied problems as edge detection (Sec- 
tion 4.2), optical flow (Section 8.4), and shape from shading (Section 12.1) (Poggio, Torre, 
and Koch 1985; Horn and Brooks 1986; Terzopoulos 1986b; Bertero, Poggio, and Torre 1988; 
Brox, Bruhn, Papenberg et al. 2004). Poggio, Torre, and Koch (1985) also showed how the 
discrete energy defined by Equations (3.100-3.101) could be implemented in a resistive grid, 
as shown in Figure 3.55. In computational photography (Chapter 10), regularization and its 
variants are commonly used to solve problems such as high-dynamic range tone mapping 
(Fattal, Lischinski, and Werman 2002; Lischinski, Farbman, Uyttendaele et al. 2006a), Pois- 
son and gradient-domain blending (Perez, Gangnet, and Blake 2003; Levin, Zomet, Peleg et 
al. 2004; Agarwala, Dontcheva, Agrawala et al. 2004), colorization (Levin, Lischinski, and 
Weiss 2004), and natural image matting (Levin, Lischinski, and Weiss 2008). 

Robust regularization 

While regularization is most commonly formulated using quadratic ( L 2 ) norms (compare 
with the squared derivatives in (3.92-3.95) and squared differences in (3.100-3.101)), it can 
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also be formulated using non-quadratic robust penalty functions (Appendix B.3). For exam- 
ple, (3.100) can be generalized to 


where p(x) is some monotonically increasing penalty function. For example, the family of 
norms p{x) = x \ p is called p-norms. When p < 2, the resulting smoothness terms become 
more piecewise continuous than totally smooth, which can better model the discontinuous 
nature of images, flow fields, and 3D surfaces. 

An early example of robust regularization is the graduated non-convexity (GNC) algo- 
rithm introduced by Blake and Zisserman (1987). Here, the norms on the data and derivatives 
are clamped to a maximum value 


Because the resulting problem is highly non-convex (it has many local minima), a continua- 
tion method is proposed, where a quadratic norm (which is convex) is gradually replaced by 
the non-convex robust norm (Allgower and Georg 2003). (Around the same time, Terzopou- 
los (1988) was also using continuation to infer the tear and crease variables in his surface 
interpolation problems.) 

Today, it is more common to use the Li ( p = 1) norm, which is often called total variation 
(Chan, Osher, and Shen 2001; Tschumperle and Deriche 2005; Tschumperle 2006; Kaftory, 
Scheduler, and Zeevi 2007). Other norms, for which the influence (derivative) more quickly 
decays to zero, are presented by Black and Rangarajan (1996); Black, Sapiro, Marimont et 
al. (1998) and discussed in Appendix B.3. 

Even more recently, hyper-Laplacian norms with p < 1 have gained popularity, based 
on the observation that the log-likelihood distribution of image derivatives follows a p ss 
0.5 — 0.8 slope and is therefore a hyper-Laplacian distribution (Simoncelli 1999; Levin and 
Weiss 2007; Weiss and Freeman 2007; Krishnan and Fergus 2009). Such norms have an even 
stronger tendency to prefer large discontinuities over small ones. See the related discussion 
in Section 3.7.2 (3.1 14). 

While least squares regularized problems using L> norms can be solved using linear sys- 
tems, other p-norms require different iterative techniques, such as iteratively reweighted least 
squares (IRLS), Levenberg-Marquardt, or alternation between local non-linear subproblems 
and global quadratic regularization (Krishnan and Fergus 2009). Such techniques are dis- 
cussed in Section 6.1.3 and Appendices A. 3 and B.3. 



(3.104) 


+ s y {i,j)p{f{ifl + 1) - 


p(x) = min.(ar, V). 


(3.105) 
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3.7.2 Markov random fields 

As we have just seen, regularization, which involves the minimization of energy functionals 
defined over (piecewise) continuous functions, can be used to formulate and solve a variety 
of low-level computer vision problems. An alternative technique is to formulate a Bayesian 
model, which separately models the noisy image formation ( measurement ) process, as well 
as assuming a statistical prior model over the solution space. In this section, we look at 
priors based on Markov random fields, whose log-likelihood can be described using local 
neighborhood interaction (or penalty) terms (Kindermann and Snell 1980; Geman and Geman 
1984; Marroquin, Mitter, and Poggio 1987; Li 1995; Szeliski, Zabih, Scharstein et al. 2008). 

The use of Bayesian modeling has several potential advantages over regularization (see 
also Appendix B). The ability to model measurement processes statistically enables us to 
extract the maximum information possible from each measurement, rather than just guessing 
what weighting to give the data. Similarly, the parameters of the prior distribution can often 
be learned by observing samples from the class we are modeling (Roth and Black 2007a; 
Tappen 2007; Li and Huttenlocher 2008). Furthermore, because our model is probabilistic, 
it is possible to estimate (in principle) complete probability distributions over the unknowns 
being recovered and, in particular, to model the uncertainty in the solution, which can be 
useful in latter processing stages. Finally, Markov random field models can be defined over 
discrete variables, such as image labels (where the variables have no proper ordering), for 
which regularization does not apply. 

Recall from (3.68) in Section 3.4.3 (or see Appendix B.4) that, according to Bayes’ Rule, 
the posterior distribution for a given set of measurements y , p(y\x), combined with a prior 
p(x) over the unknowns x, is given by 


proper (integrate to 1). Taking the negative logarithm of both sides of (3.106), we get 


which is the negative posterior log likelihood. 

To find the most likely ( maximum a posteriori or MAP) solution x given some measure- 
ments y , we simply minimize this negative log likelihood, which can also be thought of as an 
energy , 



(3.106) 


where p(y) = J x p{y\x)p{x) is a normalizing constant used to make the p{x\y) distribution 


logp(a:|y) = -logp(y|a:) - logp(a;) + C, 


(3.107) 


E{x, y) = E d (x,y) + E p (x). 


(3.108) 


(We drop the constant C because its value does not matter during energy minimization.) The 
first term E d (x, y) is the data energy or data penalty, it measures the negative log likelihood 
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that the data were observed given the unknown state x. The second term E p (x) is the prior 
energy; it plays a role analogous to the smoothness energy in regularization. Note that the 
MAP estimate may not always be desirable, since it selects the “peak” in the posterior dis- 
tribution rather than some more stable statistic — see the discussion in Appendix B.2 and by 
Levin, Weiss, Durand et al. (2009). 

For image processing applications, the unknowns x are the set of output pixels 
x= [/(0,0 ).../(m- l,n-l)], 
and the data are (in the simplest case) the input pixels 

y = [d(0, 0 ) . . . d{m — 1, n — 1)] 

as shown in Figure 3.56. 

For a Markov random field, the probability p(x) is a Gibbs or Boltzmann distribution, 
whose negative log likelihood (according to the Hammersley-Clifford theorem) can be writ- 
ten as a sum of pairwise interaction potentials, 

E P (*) = ^ Vi tjt k,i(f(i,j),f(k, l)), (3.109) 

where denotes the neighbors of pixel In fact, the general version of the theorem 

says that the energy may have to be evaluated over a larger set of cliques, which depend on 
the order of the Markov random field (Kindermann and Snell 1980; Geman and Geman 1984; 
Bishop 2006; Kohli, Ladicky, and Torr 2009; Kohli, Kumar, and Torr 2009). 

The most commonly used neighborhood in Markov random field modeling is the A/4 
neighborhood, where each pixel in the field interacts only with its immediate neigh- 

bors. The model in Figure 3.56, which we previously used in Figure 3.55 to illustrate the 
discrete version of first-order regularization, shows an A/4 MRF. The s x (i,j) and s y (i,j ) 
black boxes denote arbitrary interaction potentials between adjacent nodes in the random 
field and the w(i,j) denote the data penalty functions. These square nodes can also be inter- 
preted as factors in a factor graph version of the (undirected) graphical model (Bishop 2006), 
which is another name for interaction potentials. (Strictly speaking, the factors are (improper) 
probability functions whose product is the (un-normalized) posterior distribution.) 

As we will see in (3.112-3.113), there is a close relationship between these interaction 
potentials and the discretized versions of regularized image restoration problems. Thus, to 
a first approximation, we can view energy minimization being performed when solving a 
regularized problem and the maximum a posteriori inference being performed in an MRF as 
equivalent. 

While A/4 neighborhoods are most commonly used, in some applications Afs (or even 
higher order) neighborhoods perform better at tasks such as image segmentation because 
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Figure 3.56 Graphical model for an A /4 neighborhood Markov random field. (The blue 
edges are added for an As neighborhood.) The white circles are the unknowns while 

the dark circles are the input data d(i,j). The s x (i,j) and s v (i,j) black boxes denote arbi- 
trary interaction potentials between adjacent nodes in the random field, and the w(i,j) denote 
the data penalty functions. The same graphical model can be used to depict a discrete version 
of a first-order regularization problem (Figure 3.55). 

they can better model discontinuities at different orientations (Boykov and Kolmogorov 2003; 
Rother, Kohli, Feng et al. 2009; Kohli, Ladicky, and Torr 2009; Kohli, Kumar, and Torr 2009). 

Binary MRFs 

The simplest possible example of a Markov random field is a binary field. Examples of such 
fields include 1 -bit (black and white) scanned document images as well as images segmented 
into foreground and background regions. 

To denoise a scanned image, we set the data penalty to reflect the agreement between the 
scanned and final images, 

E d (i,j) = w5(f(i,j),d(i,j)) (3.110) 

and the smoothness penalty to reflect the agreement between neighboring pixels 

Ep(i, j) = E x (i,j) + E y (i,j) = s6(f(i,j),f(i + l,j)) + s5(f(i,j),f(i,j + l)). (3.111) 

Once we have formulated the energy, how do we minimize it? The simplest approach is 
to perform gradient descent, flipping one state at a time if it produces a lower energy. This ap- 
proach is known as contextual classification (Kittler and Foglein 1984), iterated conditional 
modes (ICM) (Besag 1986), or highest confidence first (HCF) (Chou and Brown 1990) if the 
pixel with the largest energy decrease is selected first. 

Unfortunately, these downhill methods tend to get easily stuck in local minima. An al- 
ternative approach is to add some randomness to the process, which is known as stochastic 
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gradient descent (Metropolis, Rosenbluth, Rosenbluth et al. 1953; Geman and Geman 1984). 
When the amount of noise is decreased over time, this technique is known as simulated an- 
nealing (Kirkpatrick, Gelatt, and Vecchi 1983; Carnevali, Coletti, and Patarnello 1985; Wol- 
berg and Pavlidis 1985; Swendsen and Wang 1987) and was first popularized in computer 
vision by Geman and Geman (1984) and later applied to stereo matching by Barnard (1989), 
among others. 

Even this technique, however, does not perform that well (Boykov, Veksler, and Zabih 
2001). For binary images, a much better technique, introduced to the computer vision com- 
munity by Boykov, Veksler, and Zabih (2001) is to re-formulate the energy minimization as 
a max-flow/min-cut graph optimization problem (Greig, Porteous, and Seheult 1989). This 
technique has informally come to be known as graph cuts in the computer vision community 
(Boykov and Kolmogorov 2010). For simple energy functions, e.g., those where the penalty 
for non-identical neighboring pixels is a constant, this algorithm is guaranteed to produce the 
global minimum. Kolmogorov and Zabih (2004) formally characterize the class of binary 
energy potentials ( regularity conditions ) for which these results hold, while newer work by 
Komodakis, Tziritas, and Paragios (2008) and Rother, Kolmogorov, Fempitsky et al. (2007) 
provide good algorithms for the cases when they do not. 

In addition to the above mentioned techniques, a number of other optimization approaches 
have been developed for MRF energy minimization, such as (loopy) belief propagation and 
dynamic programming (for one-dimensional problems). These are discussed in more detail 
in Appendix B.5 as well as the comparative survey paper by Szeliski, Zabih, Scharstein et al. 
(2008). 

Ordinal- valued MRFs 

In addition to binary images, Markov random fields can be applied to ordinal-valued labels 
such as grayscale images or depth maps. The term ’’ordinal” indicates that the labels have an 
implied ordering, e.g., that higher values are lighter pixels. In the next section, we look at 
unordered labels, such as source image labels for image compositing. 

In many cases, it is common to extend the binary data and smoothness prior terms as 

Ed(i,j) = w(i,j)Pd(f{i,j) - d(i,j)) (3.112) 

and 

E P (i,j) = s x (i,j)p p (f(i,j) - f(i + 1 ,j)) + Sy(i,j)p p {f(i,j) - f{i,j + 1)), (3.113) 

which are robust generalizations of the quadratic penalty terms (3.101) and (3.100), first 
introduced in (3.105). As before, the w(i,j), s x (i,j) and s y (i,j) weights can be used to 
locally control the data weighting and the horizontal and vertical smoothness. Instead of 
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(a) (b) (c) (d) 


Figure 3.57 Grayscale image denoising and inpainting: (a) original image; (b) image 
corrupted by noise and with missing data (black bar); (c) image restored using loopy be- 
lief propagation; (d) image restored using expansion move graph cuts. Images are from 
http://vision.middlebury.edu/MRF/results/ (Szeliski, Zabih, Scharstein et al. 2008). 

using a quadratic penalty, however, a general monotonically increasing penalty function />() 
is used. (Different functions can be used for the data and smoothness terms.) For example, 
p p can be a hyper-Laplacian penalty 

p p {d) = Mr P < i, (3.H4) 

which better encodes the distribution of gradients (mainly edges) in an image than either a 
quadratic or linear (total variation) penalty. 24 Levin and Weiss (2007) use such a penalty 
to separate a transmitted and reflected image (Figure 8.17) by encouraging gradients to lie in 
one or the other image, but not both. More recently. Levin, Fergus, Durand et al. (2007) use 
the hyper-Laplacian as a prior for image deconvolution (deblurring) and Krishnan and Fergus 
(2009) develop a faster algorithm for solving such problems. For the data penalty, pd can be 
quadratic (to model Gaussian noise) or the log of a contaminated Gaussian (Appendix B.3). 

When p p is a quadratic function, the resulting Markov random field is called a Gaussian 
Markov random field (GMRF) and its minimum can be found by sparse linear system solving 
(3.103). When the weighting functions are uniform, the GMRF becomes a special case of 
Wiener filtering (Section 3.4.3). Allowing the weighting functions to depend on the input 
image (a special kind of conditional random field, which we describe below) enables quite 
sophisticated image processing algorithms to be performed, including colorization (Levin, 
Lischinski, and Weiss 2004), interactive tone mapping (Lischinski, Farbman, Uyttendaele et 
al. 2006a), natural image matting (Levin, Lischinski, and Weiss 2008), and image restoration 
(Tappen, Liu, Freeman et al. 2007). 

24 Note that, unlike a quadratic penalty, the sum of the horizontal and vertical derivative /(-norms is not rotationally 
invariant. A better approach may be to locally estimate the gradient direction and to impose different norms on the 
perpendicular and parallel components, which Roth and Black (2007b) call a steerable random field. 
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(a) initial labeling (b) standard move (c) a-/3- swap (d) o-expansion 

Figure 3.58 Multi-level graph optimization from (Boykov, Veksler, and Zabih 2001) © 
2001 IEEE: (a) initial problem configuration; (b) the standard move only changes one pixel; 
(c) the a-(3- swap optimally exchanges all a and /(-labeled pixels; (d) the a-expansion move 
optimally selects among current pixel values and the a label. 


When pd or p p are non-quadratic functions, gradient descent techniques such as non- 
linear least squares or iteratively re-weighted least squares can sometimes be used (Ap- 
pendix A. 3). However, if the search space has lots of local minima, as is the case for stereo 
matching (Barnard 1989; Boykov, Veksler, and Zabih 2001), more sophisticated techniques 
are required. 

The extension of graph cut techniques to multi-valued problems was first proposed by 
Boykov, Veksler, and Zabih (2001). In their paper, they develop two different algorithms, 
called the swap move and the expansion move, which iterate among a series of binary labeling 
sub-problems to find a good solution (Figure 3.58). Note that a global solution is generally not 
achievable, as the problem is provably NP-hard for general energy functions. Because both 
these algorithms use a binary MRF optimization inside their inner loop, they are subject to the 
kind of constraints on the energy functions that occur in the binary labeling case (Kolmogorov 
and Zabih 2004). Appendix B.5.4 discusses these algorithms in more detail, along with some 
more recently developed approaches to this problem. 

Another MRF inference technique is belief propagation (BP). While belief propagation 
was originally developed for inference over trees, where it is exact (Pearl 1988), it has more 
recently been applied to graphs with loops such as Markov random fields (Freeman, Pasz- 
tor, and Carmichael 2000; Yedidia, Freeman, and Weiss 2001). In fact, some of the better 
performing stereo-matching algorithms use loopy belief propagation (LBP) to perform their 
inference (Sun, Zheng, and Shum 2003). LBP is discussed in more detail in Appendix B.5.3 
as well as the comparative survey paper on MRF optimization (Szeliski, Zabih, Scharstein et 
al. 2008). 

Figure 3.57 shows an example of image denoising and inpainting (hole filling) using a 
non-quadratic energy function (non-Gaussian MRF). The original image has been corrupted 
by noise and a portion of the data has been removed (the black bar). In this case, the loopy 
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Figure 3.59 Graphical model for a Markov random field with a more complex measurement 
model. The additional colored edges show how combinations of unknown values (say, in a 
sharp image) produce the measured values (a noisy blurred image). The resulting graphical 
model is still a classic MRF and is just as easy to sample from, but some inference algorithms 
(e.g., those based on graph cuts) may not be applicable because of the increased network 
complexity, since state changes during the inference become more entangled and the posterior 
MRF has much larger cliques. 


belief propagation algorithm computes a slightly lower energy and also a smoother image 
than the alpha-expansion graph cut algorithm. 

Of course, the above formula (3.113) for the smoothness term E p (i,j) just shows the 
simplest case. In more recent work, Roth and Black (2009) propose a Field of Experts (FoE) 
model, which sums up a large number of exponentiated local filter outputs to arrive at the 
smoothness penalty. Weiss and Freeman (2007) analyze this approach and compare it to the 
simpler hyper-Laplacian model of natural image statistics. Lyu and Simoncelli (2009) use 
Gaussian Scale Mixtures (GSMs) to construct an inhomogeneous multi-scale MRF, with one 
(positive exponential) GMRF modulating the variance (amplitude) of another Gaussian MRF. 

It is also possible to extend the measurement model to make the sampled (noise-corrupted) 
input pixels correspond to blends of unknown (latent) image pixels, as in Figure 3.59. This is 
the commonly occurring case when trying to de-blur an image. While this kind of a model is 
still a traditional generative Markov random field, finding an optimal solution can be difficult 
because the clique sizes get larger. In such situations, gradient descent techniques, such 
as iteratively reweighted least squares, can be used (Joshi, Zitnick, Szeliski et al. 2009). 
Exercise 3.31 has you explore some of these issues. 
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Figure 3.60 An unordered label MRF (Agarwala, Dontcheva, Agrawala et al. 2004) © 
2004 ACM: Strokes in each of the source images on the left are used as constraints on an 
MRF optimization, which is solved using graph cuts. The resulting multi-valued label field is 
shown as a color overlay in the middle image, and the final composite is shown on the right. 

Unordered labels 

Another case with multi-valued labels where Markov random fields are often applied are 
unordered labels, i.e., labels where there is no semantic meaning to the numerical difference 
between the values of two labels. For example, if we are classifying terrain from aerial 
imagery, it makes no sense to take the numeric difference between the labels assigned to 
forest, held, water, and pavement. In fact, the adjacencies of these various kinds of terrain 
each have different likelihoods, so it makes more sense to use a prior of the form 

E P (i,j) = s x (i,j)V(l(i,j),l(i + l,j)) + s y (i,j)V(l(i, j), l(i, j + 1)), (3.115) 

where V(lo,li) is a general compatibility or potential function. (Note that we have also 
replaced f(i,j) with l(i, j) to make it clearer that these are labels rather than discrete function 
samples.) An alternative way to write this prior energy (Boykov, Veksler, and Zabih 2001; 
Szeliski, Zabih, Scharstein el al. 2008) is 

E P= V P,*( 1 M> (3.H6) 

(p>«)e AT 

where the (p, q) are neighboring pixels and a spatially varying potential function V p _ q is eval- 
uated for each neighboring pair. 

An important application of unordered MRF labeling is seam finding in image composit- 
ing (Davis 1998; Agarwala, Dontcheva, Agrawala et al. 2004) (see Figure 3.60, which is 
explained in more detail in Section 9.3.2). Here, the compatibility V Ptq (l p , l q ) measures the 
quality of the visual appearance that would result from placing a pixel p from image l p next 
to a pixel q from image l q . As with most MRFs, we assume that V Ptq (l, l ) = 0, i.e., it is per- 
fectly fine to choose contiguous pixels from the same image. For different labels, however. 
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Figure 3.61 Image segmentation (Boykov and Funka-Lea 2006) © 2006 Springer: The user 
draws a few red strokes in the foreground object and a few blue ones in the background. The 
system computes color distributions for the foreground and background and solves a binary 
MRF. The smoothness weights are modulated by the intensity gradients (edges), which makes 
this a conditional random field (CRF). 


the compatibility V Ptq (l p , l q ) may depend on the values of the underlying pixels Ii (p) and 

h q {q)- 

Consider, for example, where one image Iq is all sky blue, i.e., Iq{p) = Io(q) = B, while 
the other image I\ has a transition from sky blue, /) (p) = B, to forest green, l \ (q) = G. 


p 

q 


p 

q 


In this case, V Pi q(l, 0) = 0 (the colors agree), while Vj, i9 (0, 1) > 0 (the colors disagree). 

Conditional random fields 

In a classic Bayesian model (3.106-3.108), 

p(x\y) <xp(y\x)p(x), (3.117) 

the prior distribution p(x) is independent of the observations y. Sometimes, however, it is 
useful to modify our prior assumptions, say about the smoothness of the field we are trying 
to estimate, in response to the sensed data. Whether this makes sense from a probability 
viewpoint is something we discuss once we have explained the new model. 

Consider the interactive image segmentation problem shown in Figure 3.61 (Boykov and 
Funka-Lea 2006). In this application, the user draws foreground (red) and background (blue) 
strokes, and the system then solves a binary MRF labeling problem to estimate the extent of 
the foreground object. In addition to minimizing a data term, which measures the pointwise 
similarity between pixel colors and the inferred region distributions (Section 5.5), the MRF 
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Figure 3.62 Graphical model for a conditional random field (CRF). The additional green 
edges show how combinations of sensed data influence the smoothness in the underlying 
MRF prior model, i.e., s x (i,j) and s y (i,j) in (3.113) depend on adjacent d(i,j) values. 
These additional links (factors) enable the smoothness to depend on the input data. However, 
they make sampling from this MRF more complex. 

is modified so that the smoothness terms s x (x,y ) and s v (x,y) in Figure 3.56 and (3.113) 
depend on the magnitude of the gradient between adjacent pixels. 23 

Since the smoothness term now depends on the data, Bayes’ Rule (3.117) no longer ap- 
plies. Instead, we use a direct model for the posterior distribution p(x\y), whose negative log 
likelihood can be written as 


using the notation introduced in (3.116). The resulting probability distribution is called a 
conditional random field (CRF) and was first introduced to the computer vision field by Ku- 
mar and Hebert (2003), based on earlier work in text modeling by Lafferty, McCallum, and 
Pereira (2001). 

Figure 3.62 shows a graphical model where the smoothness terms depend on the data 
values. In this particular model, each smoothness term depends only on its adjacent pair of 
data values, i.e., terms are of the form V Ptq (x p , x qi y pi y q ) in (3.118). 

The idea of modifying smoothness terms in response to input data is not new. For ex- 
ample, Boykov and Jolly (2001) used this idea for interactive segmentation, as shown in 
Figure 3.61, and it is now widely used in image segmentation (Section 5.5) (Blake, Rother, 

25 An alternative formulation that also uses detected edges to modulate the smoothness of a depth or motion field 
and hence to integrate multiple lower level vision modules is presented by Poggio, Gamble, and Little (1988). 


E(x\y) = E d (x,y) + E s (x,y) 



(3.118) 


V 


(p,q)eAf 
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Figure 3.63 Graphical model for a discriminative random field (DRF). The additional green 
edges show how combinations of sensed data, e.g., d(i,j + 1), influence the data term for 
The generative model is therefore more complex, i.e., we cannot just apply a simple 
function to the unknown variables and add noise. 


Brown et al. 2004; Rother, Kolmogorov, and Blake 2004), denoising (Tappen, Liu, Freeman 
et al. 2007), and object recognition (Section 14.4.3) (Winn and Shotton 2006; Shotton, Winn, 
Rother et al. 2009). 

In stereo matching, the idea of encouraging disparity discontinuities to coincide with 
intensity edges goes back even further to the early days of optimization and MRF-based 
algorithms (Poggio, Gamble, and Little 1988; Fua 1993; Bobick and Intille 1999; Boykov, 
Veksler, and Zabih 2001) and is discussed in more detail in (Section 1 1.5). 

In addition to using smoothness terms that adapt to the input data, Kumar and Hebert 
(2003) also compute a neighborhood function over the input data for each V p (x p . y) term, 
as illustrated in Figure 3.63, instead of using the classic unary MRF data term V p (x p ,y p ) 
shown in Figure 3. 56. 26 Because such neighborhood functions can be thought of as dis- 
criminant functions (a term widely used in machine learning (Bishop 2006)), they call the 
resulting graphical model a discriminative random field (DRF). In their paper, Kumar and 
Hebert (2006) show that DRFs outperform similar CRFs on a number of applications, such 
as structure detection (Figure 3.64) and binary image denoising. 

Here again, one could argue that previous stereo correspondence algorithms also look at 
a neighborhood of input data, either explicitly, because they compute correlation measures 
(Criminisi, Cross, Blake et al. 2006) as data terms, or implicitly, because even pixel-wise 
disparity costs look at several pixels in either the left or right image (Barnard 1989; Boykov, 
Veksler, and Zabih 2001). 

- 6 Kumar and Hebert (2006) call the unary potentials V p (x p . y) association potentials and the pairwise potentials 
V p .q{x p . y, r y) interaction potentials. 
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Figure 3.64 Structure detection results using an MRF (left) and a DRF (right) (Kumar and 
Hebert 2006) © 2006 Springer. 


What, then are the advantages and disadvantages of using conditional or discriminative 
random fields instead of MRFs? 

Classic Bayesian inference (MRF) assumes that the prior distribution of the data is in- 
dependent of the measurements. This makes a lot of sense: if you see a pair of sixes when 
you first throw a pair of dice, it would be unwise to assume that they will always show up 
thereafter. However, if after playing for a long time you detect a statistically significant bias, 
you may want to adjust your prior. What CRFs do, in essence, is to select or modify the prior 
model based on observed data. This can be viewed as making a partial inference over addi- 
tional hidden variables or correlations between the unknowns (say, a label, depth, or clean 
image) and the knowns (observed images). 

In some cases, the CRF approach makes a lot of sense and is, in fact, the only plausi- 
ble way to proceed. For example, in grayscale image colorization (Section 10.3.2) (Levin, 
Lischinski, and Weiss 2004), the best way to transfer the continuity information from the 
input grayscale image to the unknown color image is to modify local smoothness constraints. 
Similarly, for simultaneous segmentation and recognition (Winn and Shotton 2006; Shotton, 
Winn, Rother et al. 2009), it makes a lot of sense to permit strong color edges to influence 
the semantic image label continuities. 

In other cases, such as image denoising, the situation is more subtle. Using a non- 
quadratic (robust) smoothness term as in (3.113) plays a qualitatively similar role to setting 
the smoothness based on local gradient information in a Gaussian MRF (GMRF) (Tappen, 
Liu, Freeman et al. 2007). (In more recent work, Tanaka and Okutomi (2008) use a larger 
neighborhood and full covariance matrix on a related Gaussian MRF.) The advantage of Gaus- 
sian MRFs, when the smoothness can be correctly inferred, is that the resulting quadratic 
energy can be minimized in a single step. However, for situations where the discontinuities 
are not self-evident in the input data, such as for piecewise-smooth sparse data interpolation 
(Blake and Zisserman 1987; Terzopoulos 1988), classic robust smoothness energy minimiza- 
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tion may be preferable. Thus, as with most computer vision algorithms, a careful analysis of 
the problem at hand and desired robustness and computation constraints may be required to 
choose the best technique. 

Perhaps the biggest advantage of CRFs and DRFs, as argued by Kumar and Hebert (2006), 
Tappen, Liu, Freeman et al. (2007) and Blake, Rother, Brown el al. (2004), is that learning the 
model parameters is sometimes easier. While learning parameters in MRFs and their variants 
is not a topic that we cover in this book, interested readers can find more details in recently 
published articles (Kumar and Hebert 2006; Roth and Black 2007a; Tappen, Liu, Freeman et 
al. 2007; Tappen 2007; Li and Huttenlocher 2008). 

3.7.3 Application : Image restoration 

In Section 3.4.4, we saw how two-dimensional linear and non-linear filters can be used to 
remove noise or enhance sharpness in images. Sometimes, however, images are degraded by 
larger problems, such as scratches and blotches (Kokaram 2004). In this case, Bayesian meth- 
ods such as MRFs, which can model spatially varying per-pixel measurement noise, can be 
used instead. An alternative is to use hole filling or inpainting techniques (Bertalmio, Sapiro, 
Caselles et al. 2000; Bertalmio, Vese, Sapiro et al. 2003; Criminisi, Perez, and Toyama 2004), 
as discussed in Sections 5.1.4 and 10.5.1. 

Figure 3.57 shows an example of image denoising and inpainting (hole filling) using a 
Markov random field. The original image has been corrupted by noise and a portion of the 
data has been removed. In this case, the loopy belief propagation algorithm computes a 
slightly lower energy and also a smoother image than the alpha-expansion graph cut algo- 
rithm. 


3.8 Additional reading 

If you are interested in exploring the topic of image processing in more depth, some popular 
textbooks have been written by Lim (1990); Crane (1997); Gomes and Velho (1997); Jahne 
(1997); Pratt (2007); Russ (2007); Burger and Burge (2008); Gonzales and Woods (2008). 
The pre-eminent conference and journal in this field are the IEEE Conference on Image Pro- 
cesssing and the IEEE Transactions on Image Processing. 

For image compositing operators, the seminal reference is by Porter and Duff (1984) 
while Blinn (1994a, b) provides a more detailed tutorial. For image compositing. Smith and 
Blinn (1996) were the first to bring this topic to the attention of the graphics community, 
while Wang and Cohen (2007a) provide a recent in-depth survey. 

In the realm of linear filtering. Freeman and Adelson (1991) provide a great introduc- 
tion to separable and steerable oriented band-pass filters, while Perona (1995) shows how to 
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approximate any filter as a sum of separable components. 

The literature on non-linear filtering is quite wide and varied; it includes such topics as 
bilateral filtering (Tomasi and Manduchi 1998; Durand and Dorsey 2002; Paris and Durand 
2006; Chen, Paris, and Durand 2007; Paris, Kornprobst, Tumblin et al. 2008), related itera- 
tive algorithms (Saint-Marc, Chen, and Medioni 1991; Nielsen, Florack, and Deriche 1997; 
Black, Sapiro, Marimont et al. 1998; Weickert, ter Haar Romeny, and Viergever 1998; Weick- 
ert 1998; Barash 2002; Scharr, Black, and Haussecker 2003; Barash and Comaniciu 2004), 
and variational approaches (Chan, Osher, and Shen 2001; Tschumperle and Deriche 2005; 
Tschumperle 2006; Kaftory, Schechner, and Zeevi 2007). 

Good references to image morphology include (Haralick and Shapiro 1992, Section 5.2; 
Bovik 2000, Section 2.2; Ritter and Wilson 2000, Section 7; Serra 1982; Serra and Vincent 
1992; Yuille, Vincent, and Geiger 1992; Soille 2006). 

The classic papers for image pyramids and pyramid blending are by Burt and Adelson 
(1983a,b). Wavelets were first introduced to the computer vision community by Mallat (1989) 
and good tutorial and review papers and books are available (Strang 1989; Simoncelli and 
Adelson 1990b; Rioul and Vetterli 1991; Chui 1992; Meyer 1993; Sweldens 1997). Wavelets 
are widely used in the computer graphics community to perform multi-resolution geomet- 
ric processing (Stollnitz, DeRose, and Salesin 1996) and have been used in computer vision 
for similar applications (Szeliski 1990b; Pentland 1994; Gortler and Cohen 1995; Yaou and 
Chang 1994; Lai and Vemuri 1997; Szeliski 2006b), as well as for multi-scale oriented filter- 
ing (Simoncelli, Freeman, Adelson et al. 1992) and denoising (Portilla, Strela, Wainwright et 
al. 2003). 

While image pyramids (Section 3.5.3) are usually constmcted using linear filtering op- 
erators, some recent work has started investigating non-linear filters, since these can better 
preserve details and other salient features. Some representative papers in the computer vision 
literature are by Gluckman (2006a, b); Lyu and Simoncelli (2008) and in computational pho- 
tography by Bae, Paris, and Durand (2006); Farbman, Fattal, Lischinski et al. (2008); Fattal 
(2009). 

High-quality algorithms for image warping and resampling are covered both in the im- 
age processing literature (Wolberg 1990; Dodgson 1992; Gomes, Darsa, Costa et al. 1999; 
Szeliski, Winder, and Uyttendaele 2010) and in computer graphics (Williams 1983; Heckbert 
1986; Barkans 1997; Akenine-Moller and Haines 2002), where they go under the name of 
texture mapping. Combination of image warping and image blending techniques are used to 
enable morphing between images, which is covered in a series of seminal papers and books 
(Beier and Neely 1992; Gomes, Darsa, Costa et al. 1999). 

The regularization approach to computer vision problems was first introduced to the vi- 
sion community by Poggio, Torre, and Koch (1985) and Terzopoulos (1986a, b, 1988) and 
continues to be a popular framework for formulating and solving low-level vision problems 
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(Ju, Black, and Jepson 1996; Nielsen, Florack, and Deriche 1997; Nordstrom 1990; Brox, 
Bruhn, Papenberg el al. 2004; Levin, Lischinski, and Weiss 2008). More detailed mathe- 
matical treatment and additional applications can be found in the applied mathematics and 
statistics literature (Tikhonov and Arsenin 1977; Engl, Hanke, and Neubauer 1996). 

The literature on Markov random fields is truly immense, with publications in related 
fields such as optimization and control theory of which few vision practitioners are even 
aware. A good guide to the latest techniques is the book edited by Blake, Kohli, and Rother 
(2010). Other recent articles that contain nice literature reviews or experimental compar- 
isons include (Boykov and Funka-Lea 2006; Szeliski, Zabih, Scharstein el al. 2008; Kumar, 
Veksler, and Torr 2010). 

The seminal paper on Markov random fields is the work of Geman and Geman (1984), 
who introduced this formalism to computer vision researchers and also introduced the no- 
tion of line processes, additional binary variables that control whether smoothness penalties 
are enforced or not. Black and Rangarajan (1996) showed how independent line processes 
could be replaced with robust pairwise potentials; Boykov, Veksler, and Zabih (2001) devel- 
oped iterative binary, graph cut algorithms for optimizing multi -label MRFs; Kolmogorov 
and Zabih (2004) characterized the class of binary energy potentials required for these tech- 
niques to work; and Freeman, Pasztor, and Carmichael (2000) popularized the use of loopy 
belief propagation for MRF inference. Many more additional references can be found in 
Sections 3.7.2 and 5.5, and Appendix B.5. 

3.9 Exercises 

Ex 3.1: Color balance Write a simple application to change the color balance of an image 
by multiplying each color value by a different user-specified constant. If you want to get 
fancy, you can make this application interactive, with sliders. 

1 . Do you get different results if you take out the gamma transformation before or after 
doing the multiplication? Why or why not? 

2. Take the same picture with your digital camera using different color balance settings 
(most cameras control the color balance from one of the menus). Can you recover what 
the color balance ratios are between the different settings? You may need to put your 
camera on a tripod and align the images manually or automatically to make this work. 
Alternatively, use a color checker chart (Figure 10.3b), as discussed in Sections 2.3 and 
10 . 1 . 1 . 

3. If you have access to the RAW image for the camera, perform the demosaicing yourself 
(Section 10.3. 1) or downsample the image resolution to get a “true” RGB image. Does 
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your camera perform a simple linear mapping between RAW values and the color- 
balanced values in a JPEG? Some high-end cameras have a RAW+JPEG mode, which 
makes this comparison much easier. 

4. Can you think of any reason why you might want to perform a color twist (Sec- 
tion 3.1.2) on the images? See also Exercise 2.9 for some related ideas. 

Ex 3.2: Compositing and reflections Section 3.1.3 describes the process of compositing 
an alpha-matted image on top of another. Answer the following questions and optionally 
validate them experimentally: 

1 . Most captured images have gamma correction applied to them. Does this invalidate the 
basic compositing equation (3.8); if so, how should it be fixed? 

2. The additive (pure reflection) model may have limitations. What happens if the glass is 
tinted, especially to a non-gray hue? How about if the glass is dirty or smudged? How 
could you model wavy glass or other kinds of refractive objects? 

Ex 3.3: Blue screen matting Set up a blue or green background, e.g., by buying a large 
piece of colored posterboard. Take a picture of the empty background, and then of the back- 
ground with a new object in front of it. Pull the matte using the difference between each 
colored pixel and its assumed corresponding background pixel, using one of the techniques 
described in Section 3.1.3) or by Smith and Blinn (1996). 

Ex 3.4: Difference keying Implement a difference keying algorithm (see Section 3.1.3) 
(Toyama, Krumm, Brumitt et al. 1999), consisting of the following steps: 

1 . Compute the mean and variance (or median and robust variance) at each pixel in an 
“empty” video sequence. 

2. For each new frame, classify each pixel as foreground or background (set the back- 
ground pixels to RGBA=0). 

3. (Optional) Compute the alpha channel and composite over a new background. 

4. (Optional) Clean up the image using morphology (Section 3.3.1), label the connected 
components (Section 3.3.4), compute their centroids, and track them from frame to 
frame. Use this to build a “people counter”. 

Ex 3.5: Photo effects Write a variety of photo enhancement or effects filters: contrast, so- 
larization (quantization), etc. Which ones are useful (perform sensible corrections) and which 
ones are more creative (create unusual images)? 
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Ex 3.6: Histogram equalization Compute the gray level (luminance) histogram for an im- 
age and equalize it so that the tones look better (and the image is less sensitive to exposure 
settings). You may want to use the following steps: 

1. Convert the color image to luminance (Section 3.1.2). 

2. Compute the histogram, the cumulative distribution, and the compensation transfer 
function (Section 3.1.4). 

3. (Optional) Try to increase the “punch” in the image by ensuring that a certain fraction 
of pixels (say, 5%) are mapped to pure black and white. 

4. (Optional) Limit the local gain f'(I) in the transfer function. One way to do this is to 
limit /(/) < 7 / or /'(/) < 7 while performing the accumulation (3.9), keeping any 
unaccumulated values “in reserve”. (I’ll let you figure out the exact details.) 

5. Compensate the luminance channel through the lookup table and re-generate the color 
image using color ratios (2.116). 

6 . (Optional) Color values that are clipped in the original image, i.e., have one or more 
saturated color channels, may appear unnatural when remapped to a non-clipped value. 
Extend your algorithm to handle this case in some useful way. 

Ex 3.7: Local histogram equalization Compute the gray level (luminance) histograms for 
each patch, but add to vertices based on distance (a spline). 

1. Build on Exercise 3.6 (luminance computation). 

2. Distribute values (counts) to adjacent vertices (bilinear). 

3. Convert to CDF (look-up functions). 

4. (Optional) Use low-pass filtering of CDFs. 

5. Interpolate adjacent CDFs for final lookup. 

Ex 3.8: Padding for neighborhood operations Write down the formulas for computing 
the padded pixel values as a function of the original pixel values f(k, l) and the image 

width and height (M, N j for each of the padding modes shown in Figure 3.13. For example, 
for replication (clamping). 




k = max( 0 ,min(M — 1 , *)) , 
l = max( 0 , min (A" — 1 , j)), 


(Hint: you may want to use the min, max, mod, and absolute value operators in addition to 
the regular arithmetic operators.) 
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• Describe in more detail the advantages and disadvantages of these various modes. 

• (Optional) Check what your graphics card does by drawing a texture-mapped rectangle 
where the texture coordinates lie beyond the [0.0, 1.0] range and using different texture 
clamping modes. 

Ex 3.9: Separable filters Implement convolution with a separable kernel. The input should 
be a grayscale or color image along with the horizontal and vertical kernels. Make sure 
you support the padding mechanisms developed in the previous exercise. You will need this 
functionality for some of the later exercises. If you already have access to separable filtering 
in an image processing package you are using (such as IPL), skip this exercise. 

• (Optional) Use Pietro Perona’s (1995) technique to approximate convolution as a sum 
of a number of separable kernels. Let the user specify the number of kernels and report 
back some sensible metric of the approximation fidelity. 

Ex 3.10: Discrete Gaussian filters Discuss the following issues with implementing a dis- 
crete Gaussian filter: 

• If you just sample the equation of a continuous Gaussian filter at discrete locations, 
will you get the desired properties, e.g., will the coefficients sum up to 0? Similarly, if 
you sample a derivative of a Gaussian, do the samples sum up to 0 or have vanishing 
higher-order moments? 

• Would it be preferable to take the original signal, interpolate it with a sine, blur with a 
continuous Gaussian, then pre-filter with a sine before re-sampling? Is there a simpler 
way to do this in the frequency domain? 

• Would it make more sense to produce a Gaussian frequency response in the Fourier 
domain and to then take an inverse FFT to obtain a discrete filter? 

• How does truncation of the filter change its frequency response? Does it introduce any 
additional artifacts? 

• Are the resulting two-dimensional filters as rotationally invariant as their continuous 
analogs? Is there some way to improve this? In fact, can any 2D discrete (separable or 
non-separable) filter be truly rotationally invariant? 

Ex 3.11: Sharpening, blur, and noise removal Implement some softening, sharpening, and 
non-linear diffusion (selective sharpening or noise removal) filters, such as Gaussian, median, 
and bilateral (Section 3.3.1), as discussed in Section 3.4.4. 

Take blurry or noisy images (shooting in low light is a good way to get both) and try to 
improve their appearance and legibility. 
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Ex 3.12: Steerable filters Implement Freeman and Adelson’s (1991) steerable filter algo- 
rithm. The input should be a grayscale or color image and the output should be a multi-banded 
image consisting of C { \ and (J'f 1 . The coefficients for the filters can be found in the paper 
by Freeman and Adelson (1991). 

Test the various order filters on a number of images of your choice and see if you can 
reliably find corner and intersection features. These filters will be quite useful later to detect 
elongated structures, such as lines (Section 4.3). 

Ex 3.13: Distance transform Implement some (raster-scan) algorithms for city block and 
Euclidean distance transforms. Can you do it without peeking at the literature (Danielsson 
1980; Borgefors 1986)? If so, what problems did you come across and resolve? 

Later on, you can use the distance functions you compute to perform feathering during 
image stitching (Section 9.3.2). 

Ex 3.14: Connected components Implement one of the connected component algorithms 
from Section 3.3.4 or Section 2.3 from Haralick and Shapiro’s book (1992) and discuss its 
computational complexity. 

• Threshold or quantize an image to obtain a variety of input labels and then compute the 
area statistics for the regions that you find. 

• Use the connected components that you have found to track or match regions in differ- 
ent images or video frames. 

Ex 3.15: Fourier transform Prove the properties of the Fourier transform listed in Ta- 
ble 3.1 and derive the formulas for the Fourier transforms listed in Tables 3.2 and 3.3. These 
exercises are very useful if you want to become comfortable working with Fourier transforms, 
which is a very useful skill when analyzing and designing the behavior and efficiency of many 
computer vision algorithms. 

Ex 3.16: Wiener filtering Estimate the frequency spectrum of your personal photo collec- 
tion and use it to perform Wiener filtering on a few images with varying degrees of noise. 

1. Collect a few hundred of your images by re-scaling them to fit within a 512 x 512 
window and cropping them. 

2. Take their Fourier transforms, throw away the phase information, and average together 
all of the spectra. 

3. Pick two of your favorite images and add varying amounts of Gaussian noise, a n £ 
{1, 2, 5, 10, 20} gray levels. 
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4. For each combination of image and noise, determine by eye which width of a Gaussian 
blurring filter a s gives the best denoised result. You will have to make a subjective 
decision between sharpness and noise. 

5. Compute the Wiener filtered version of all the noised images and compare them against 
your hand-tuned Gaussian-smoothed images. 

6. (Optional) Do your image spectra have a lot of energy concentrated along the horizontal 
and vertical axes ( f x = 0 and f y = 0)? Can you think of an explanation for this? Does 
rotating your image samples by 45° move this energy to the diagonals? If not, could it 
be due to edge effects in the Fourier transform? Can you suggest some techniques for 
reducing such effects? 

Ex 3.17: Deblurring using Wiener filtering Use Wiener filtering to deblur some images. 

1. Modify the Wiener filter derivation (3.66-3.74) to incorporate blur (3.75). 

2. Discuss the resulting Wiener filter in terms of its noise suppression and frequency 
boosting characteristics. 

3. Assuming that the blur kernel is Gaussian and the image spectrum follows an inverse 
frequency law, compute the frequency response of the Wiener filter, and compare it to 
the unsharp mask. 

4. Synthetically blur two of your sample images with Gaussian blur kernels of different 
radii, add noise, and then perform Wiener filtering. 

5. Repeat the above experiment with a “pillbox” (disc) blurring kernel, which is charac- 
teristic of a finite aperture lens (Section 2.2.3). Compare these results to Gaussian blur 
kernels (be sure to inspect your frequency plots). 

6. It has been suggested that regular apertures are anathema to de-blurring because they 
introduce zeros in the sensed frequency spectrum (Veeraraghavan, Raskar, Agrawal et 
al. 2007). Show that this is indeed an issue if no prior model is assumed for the signal, 
i.e., P~ 1 Z 1 . If a reasonable power spectrum is assumed, is this still a problem (do we 
still get banding or ringing artifacts)? 

Ex 3.18: High-quality image resampling Implement several of the low-pass filters pre- 
sented in Section 3.5.2 and also the discussion of the windowed sine shown in Table 3.2 and 
Figure 3.29. Feel free to implement other filters (Wolberg 1990; Unser 1999). 

Apply your filters to continuously resize an image, both magnifying (interpolating) and 
minifying (decimating) it; compare the resulting animations for several filters. Use both a 


200 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



(a) (b) (c) 


Figure 3.65 Sample images for testing the quality of resampling algorithms: (a) a synthetic 
chirp; (b) and (c) some high-frequency images from the image compression community. 

synthetic chirp image (Figure 3.65a) and natural images with lots of high-frequency detail 
(Figure 3.65b-c). 27 

You may find it helpful to write a simple visualization program that continuously plays the 
animations for two or more filters at once and that let you “blink” between different results. 

Discuss the merits and deficiencies of each filter, as well as its tradeoff between speed and 
quality. 

Ex 3.19: Pyramids Construct an image pyramid. The inputs should be a grayscale or color 
image, a separable filter kernel, and the number of desired levels. Implement at least the 
following kernels: 

• 2 x 2 block filtering; 

• Burt and Adelson’s binomial kernel 1 /ie(l , 4, 6, 4, 1) (Burt and Adelson 1983a); 

• a high-quality seven- or nine-tap filter. 

Compare the visual quality of the various decimation filters. Also, shift your input image by 
1 to 4 pixels and compare the resulting decimated (quarter size) image sequence. 

Ex 3.20: Pyramid blending Write a program that takes as input two color images and a 
binary mask image and produces the Laplacian pyramid blend of the two images. 

1 . Construct the Laplacian pyramid for each image. 

2. Construct the Gaussian pyramid for the two mask images (the input image and its 
complement). 

- 7 These particular images are available on the book’s Web site. 
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3. Multiply each Laplacian image by its corresponding mask and sum the images (see 
Figure 3.43). 

4. Reconstruct the final image from the blended Laplacian pyramid. 

Generalize your algorithm to input n images and a label image with values 1 . . . n (the value 
0 can be reserved for “no input”). Discuss whether the weighted summation stage (step 3) 
needs to keep track of the total weight for renormalization, or whether the math just works 
out. Use your algorithm either to blend two differently exposed image (to avoid under- and 
over-exposed regions) or to make a creative blend of two different scenes. 

Ex 3.21: Wavelet construction and applications Implement one of the wavelet families 
described in Section 3.5.4 or by Simoncelli and Adelson (1990b), as well as the basic Lapla- 
cian pyramid (Exercise 3.19). Apply the resulting representations to one of the following two 
tasks: 

• Compression: Compute the entropy in each band for the different wavelet implemen- 
tations, assuming a given quantization level (say, Vi gray level, to keep the rounding 
error acceptable). Quantize the wavelet coefficients and reconstruct the original im- 
ages. Which technique performs better? (See (Simoncelli and Adelson 1990b) or any 
of the multitude of wavelet compression papers for some typical results.) 

• Denoising. After computing the wavelets, suppress small values using coring , i.e., set 
small values to zero using a piecewise linear or other C'° function. Compare the results 
of your denoising using different wavelet and pyramid representations. 

Ex 3.22: Parametric image warping Write the code to do affine and perspective image 
warps (optionally bilinear as well). Try a variety of interpolants and report on their visual 
quality. In particular, discuss the following: 

• In a MIP-map, selecting only the coarser level adjacent to the computed fractional 
level will produce a blurrier image, while selecting the finer level will lead to aliasing. 
Explain why this is so and discuss whether blending an aliased and a blurred image 
(tri-linear MIP -mapping) is a good idea. 

• When the ratio of the horizontal and vertical resampling rates becomes very different 
(anisotropic), the MIP-map performs even worse. Suggest some approaches to reduce 
such problems. 

Ex 3.23: Local image warping Open an image and deform its appearance in one of the 
following ways: 
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1. Click on a number of pixels and move (drag) them to new locations. Interpolate the 
resulting sparse displacement field to obtain a dense motion field (Sections 3.6.2 and 
3.5.1). 

2. Draw a number of lines in the image. Move the endpoints of the lines to specify their 
new positions and use the Beier-Neely interpolation algorithm (Beier and Neely 1992), 
discussed in Section 3.6.2, to get a dense motion field. 

3. Overlay a spline control grid and move one grid point at a time (optionally select the 
level of the deformation). 

4. Have a dense per-pixel flow field and use a soft “paintbrush” to design a horizontal and 
vertical velocity field. 

5. (Optional): Prove whether the Beier-Neely warp does or does not reduce to a sparse 
point-based deformation as the line segments become shorter (reduce to points). 

Ex 3.24: Forward warping Given a displacement field from the previous exercise, write a 
forward warping algorithm: 

1 . Write a forward warper using splatting, either nearest neighbor or soft accumulation 
(Section 3.6.1). 

2. Write a two-pass algorithm, which forward warps the displacement field, fills in small 
holes, and then uses inverse warping (Shade, Gortler, He el al. 1998). 

3. Compare the quality of these two algorithms. 


Ex 3.25: Feature-based morphing Extend the warping code you wrote in Exercise 3.23 
to import two different images and specify correspondences (point, line, or mesh-based) be- 
tween the two images. 

1 . Create a morph by partially warping the images towards each other and cross-dissolving 
(Section 3.6.3). 

2. Try using your morphing algorithm to perform an image rotation and discuss whether 
it behaves the way you want it to. 

Ex 3.26: 2D image editor Extend the program you wrote in Exercise 2.2 to import images 
and let you create a “collage” of pictures. You should implement the following steps: 

1. Open up a new image (in a separate window). 
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Figure 3.66 There is a faint image of a rainbow visible in the right hand side of this picture. 
Can you think of a way to enhance it (Exercise 3.29)? 

2. Shift drag (rubber-band) to crop a subregion (or select whole image). 

3. Paste into the current canvas. 

4. Select the deformation mode (motion model): translation, rigid, similarity, affine, or 
perspective. 

5. Drag any corner of the outline to change its transformation. 

6. (Optional) Change the relative ordering of the images and which image is currently 
being manipulated. 

The user should see the composition of the various images’ pieces on top of each other. 

This exercise should be built on the image transformation classes supported in the soft- 
ware library. Persistence of the created representation (save and load) should also be sup- 
ported (for each image, save its transformation). 

Ex 3.27: 3D texture-mapped viewer Extend the viewer you created in Exercise 2.3 to in- 
clude texture-mapped polygon rendering. Augment each polygon with (u, v . w) coordinates 
into an image. 

Ex 3.28: Image denoising Implement at least two of the various image denoising tech- 
niques described in this chapter and compare them on both synthetically noised image se- 
quences and real-world (low-light) sequences. Does the performance of the algorithm de- 
pend on the correct choice of noise level estimate? Can you draw any conclusions as to 
which techniques work better? 
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Ex 3.29: Rainbow enhancer — challenging Take a picture containing a rainbow, such as 
Figure 3.66, and enhance the strength (saturation) of the rainbow. 

1 . Draw an arc in the image delineating the extent of the rainbow. 

2. Fit an additive rainbow function (explain why it is additive) to this arc (it is best to work 
with linearized pixel values), using the spectrum as the cross section, and estimating 
the width of the arc and the amount of color being added. This is the trickiest part of 
the problem, as you need to tease apart the (low-frequency) rainbow pattern and the 
natural image hiding behind it. 

3. Amplify the rainbow signal and add it back into the image, re-applying the gamma 
function if necessary to produce the final image. 

Ex 3.30: Image deblocking — challenging Now that you have some good techniques to 
distinguish signal from noise, develop a technique to remove the blocking artifacts that occur 
with JPEG at high compression settings (Section 2.3.3). Your technique can be as simple 
as looking for unexpected edges along block boundaries, to looking at the quantization step 
as a projection of a convex region of the transform coefficient space onto the corresponding 
quantized values. 

1 . Does the knowledge of the compression factor, which is available in the JPEG header 
information, help you perform better deblocking? 

2. Because the quantization occurs in the DCT transformed YCbCr space (2.1 15), it may 
be preferable to perform the analysis in this space. On the other hand, image priors 
make more sense in an RGB space (or do they?). Decide how you will approach this 
dichotomy and discuss your choice. 

3. While you are at it, since the YCbCr conversion is followed by a chrominance subsam- 
pling stage (before the DCT), see if you can restore some of the lost high-frequency 
chrominance signal using one of the better restoration techniques discussed in this 
chapter. 

4. If your camera has a RAW + JPEG mode, how close can you come to the noise-free 
true pixel values? (This suggestion may not be that useful, since cameras generally use 
reasonably high quality settings for their RAW + JPEG models.) 

Ex 3.31: Inference in de-blurring — challenging Write down the graphical model corre- 
sponding to Figure 3.59 for a non-blind image deblurring problem, i.e., one where the blur 
kernel is known ahead of time. 

What kind of efficient inference (optimization) algorithms can you think of for solving 
such problems? 
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Figure 4.1 A variety of feature detectors and descriptors can be used to analyze, describe and 
match images: (a) point-like interest operators (Brown, Szeliski, and Winder 2005) © 2005 
IEEE; (b) region-like interest operators (Matas, Chum, Urban el al. 2004) © 2004 Elsevier; 
(c) edges (Elder and Goldberg 2001) © 2001 IEEE; (d) straight lines (Sinha, Steedly, Szeliski 
et al. 2008) © 2008 ACM. 





4. 1 Points and patches 


207 


Feature detection and matching are an essential component of many computer vision appli- 
cations. Consider the two pairs of images shown in Figure 4.2. For the first pair, we may 
wish to align the two images so that they can be seamlessly stitched into a composite mosaic 
(Chapter 9). For the second pair, we may wish to establish a dense set of correspondences so 
that a 3D model can be constructed or an in-between view can be generated (Chapter 1 1). In 
either case, what kinds of features should you detect and then match in order to establish such 
an alignment or set of correspondences? Think about this for a few moments before reading 
on. 

The first kind of feature that you may notice are specific locations in the images, such as 
mountain peaks, building corners, doorways, or interestingly shaped patches of snow. These 
kinds of localized feature are often called keypoint features or interest points (or even corners) 
and are often described by the appearance of patches of pixels surrounding the point location 
(Section 4.1). Another class of important features are edges, e.g., the profile of mountains 
against the sky, (Section 4.2). These kinds of features can be matched based on their orien- 
tation and local appearance (edge profiles) and can also be good indicators of object bound- 
aries and occlusion events in image sequences. Edges can be grouped into longer curves and 
straight line segments, which can be directly matched or analyzed to find vanishing points 
and hence internal and external camera parameters (Section 4.3). 

In this chapter, we describe some practical approaches to detecting such features and 
also discuss how feature correspondences can be established across different images. Point 
features are now used in such a wide variety of applications that it is good practice to read and 
implement some of the algorithms from (Section 4.1). Edges and lines provide information 
that is complementary to both keypoint and region-based descriptors and are well-suited to 
describing object boundaries and man-made objects. These alternative descriptors, while 
extremely useful, can be skipped in a short introductory course. 


4.1 Points and patches 

Point features can be used to find a sparse set of corresponding locations in different im- 
ages, often as a pre-cursor to computing camera pose (Chapter 7), which is a prerequisite for 
computing a denser set of correspondences using stereo matching (Chapter 11). Such corre- 
spondences can also be used to align different images, e.g., when stitching image mosaics or 
performing video stabilization (Chapter 9). They are also used extensively to perform object 
instance and category recognition (Sections 14.3 and 14.4). A key advantage of keypoints 
is that they permit matching even in the presence of clutter (occlusion) and large scale and 
orientation changes. 

Feature-based correspondence techniques have been used since the early days of stereo 
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Figure 4.2 Two pairs of images to be matched. What kinds of feature might one use to 
establish a set of correspondences between these images? 


matching (Hannah 1974; Moravec 1983; Hannah 1988) and have more recently gained pop- 
ularity for image-stitching applications (Zoghlami, Faugeras, and Deriche 1997; Brown and 
Lowe 2007) as well as fully automated 3D modeling (Beardsley, Torr, and Zisserman 1996; 
Schaffalitzky and Zisserman 2002; Brown and Lowe 2003; Snavely, Seitz, and Szeliski 2006). 

There are two main approaches to finding feature points and their correspondences. The 
first is to find features in one image that can be accurately tracked using a local search tech- 
nique, such as correlation or least squares (Section 4.1.4). The second is to independently 
detect features in all the images under consideration and then match features based on their 
local appearance (Section 4.1.3). The former approach is more suitable when images are 
taken from nearby viewpoints or in rapid succession (e.g., video sequences), while the lat- 
ter is more suitable when a large amount of motion or appearance change is expected, e.g., 
in stitching together panoramas (Brown and Lowe 2007), establishing correspondences in 
wide baseline stereo (Schaffalitzky and Zisserman 2002), or performing object recognition 
(Fergus, Perona, and Zisserman 2007). 

In this section, we split the keypoint detection and matching pipeline into four separate 
stages. During the, feature detection (extraction) stage (Section 4.1.1), each image is searched 
for locations that are likely to match well in other images. At the feature description stage 
(Section 4.1.2), each region around detected keypoint locations is converted into a more com- 
pact and stable (invariant) descriptor that can be matched against other descriptors. The 
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Figure 4.3 Image pairs with extracted patches below. Notice how some patches can be 
localized or matched with higher accuracy than others. 


feature matching stage (Section 4.1.3) efficiently searches for likely matching candidates in 
other images. The feature tracking stage (Section 4.1.4) is an alternative to the third stage 
that only searches a small neighborhood around each detected feature and is therefore more 
suitable for video processing. 

A wonderful example of all of these stages can be found in David Lowe’s (2004) paper, 
which describes the development and refinement of his Scale Invariant Feature Transform 
(SIFT). Comprehensive descriptions of alternative techniques can be found in a series of 
survey and evaluation papers covering both feature detection (Schmid, Mohr, and Bauck- 
hage 2000; Mikolajczyk, Tuytelaars, Schmid et al. 2005; Tuytelaars and Mikolajczyk 2007) 
and feature descriptors (Mikolajczyk and Schmid 2005). Shi and Tomasi (1994) and Triggs 
(2004) also provide nice reviews of feature detection techniques. 

4.1.1 Feature detectors 

How can we find image locations where we can reliably find correspondences with other 
images, i.e., what are good features to track (Shi and Tomasi 1994; Triggs 2004)? Look again 
at the image pair shown in Figure 4.3 and at the three sample patches to see how well they 
might be matched or tracked. As you may notice, textureless patches are nearly impossible 
to localize. Patches with large contrast changes (gradients) are easier to localize, although 
straight line segments at a single orientation suffer from the aperture problem (Horn and 
Schunck 1981; Lucas and Kanade 1981; Anandan 1989), i.e., it is only possible to align 
the patches along the direction normal to the edge direction (Figure 4.4b). Patches with 
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Figure 4.4 Aperture problems for different image patches: (a) stable (“corner-like”) flow; 
(b) classic aperture problem (barber-pole illusion); (c) textureless region. The two images Iq 
(yellow) and I \ (red) are overlaid. The red vector u indicates the displacement between the 
patch centers and the w(xf) weighting function (patch window) is shown as a dark circle. 


gradients in at least two (significantly) different orientations are the easiest to localize, as 
shown schematically in Figure 4.4a. 

These intuitions can be formalized by looking at the simplest possible matching criterion 
for comparing two image patches, i.e., their (weighted) summed square difference, 

^wssd(m) = 'Y^,w(xi)[h(x i + u) - Io(x z )} 2 , (4.1) 

i 

where Iq and l\ are the two images being compared, u = (u, v ) is the displacement vector, 
w(x) is a spatially varying weighting (or window) function, and the summation i is over all 
the pixels in the patch. Note that this is the same formulation we later use to estimate motion 
between complete images (Section 8.1). 

When performing feature detection, we do not know which other image locations the 
feature will end up being matched against. Therefore, we can only compute how stable this 
metric is with respect to small variations in position Ait by comparing an image patch against 
itself, which is known as an auto-correlation function or surface 

-Eac(Aw) = ^2w(xi)[I 0 (xi + Au) - I 0 (xi)} 2 (4.2) 

i 

(Figure 4.5). 1 Note how the auto-correlation surface for the textured flower bed (Figure 4.5b 
and the red cross in the lower right quadrant of Figure 4.5a) exhibits a strong minimum, 
indicating that it can be well localized. The correlation surface corresponding to the roof 
edge (Figure 4.5c) has a strong ambiguity along one direction, while the correlation surface 
corresponding to the cloud region (Figure 4.5d) has no stable minimum. 

1 Strictly speaking, a correlation is the product of two patches (3.12); I’m using the term here in a more qualitative 

sense. The weighted sum of squared differences is often called an SSD surface (Section 8.1). 
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Figure 4.5 Three auto-correlation surfaces Eac(Au) shown as both grayscale images and 
surface plots: (a) The original image is marked with three red crosses to denote where the 
auto-correlation surfaces were computed; (b) this patch is from the flower bed (good unique 
minimum); (c) this patch is from the roof edge (one-dimensional aperture problem); and (d) 
this patch is from the cloud (no good peak). Each grid point in figures b-d is one value of 
A u. 
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Using a Taylor Series expansion of the image function Iq(xi + Art) ss Io(xi) + V Io(xi) ■ 
A u (Lucas and Kanade 1981; Shi and Tomasi 1994), we can approximate the auto-correlation 
surface as 


^2w(xi)[I 0 (xi + Am) - I 0 (xi)] 2 

i 

(4.3) 

’^2w(x i )[I 0 (x i ) + V/ 0 (£Ci) • Am - Io(xi)] 2 

(4.4) 

^ ~2w(xi)[VIo(xi ) • Am] 2 

(4.5) 

Au t AAu, 

(4.6) 

X7T / \ , dI o Wo,, \ 

V/ 0 (x.) = — Xx.) 

(4.7) 


where 


is the image gradient at Xi . This gradient can be computed using a variety of techniques 
(Schmid, Mohr, and Bauckhage 2000). The classic “Harris” detector (Harris and Stephens 
1988) uses a [-2 -10 12] filter, but more modern variants (Schmid, Mohr, and Bauckhage 
2000; Triggs 2004) convolve the image with horizontal and vertical derivatives of a Gaussian 
(typically with a = 1). 

The auto-correlation matrix A can be written as 


A = w * 


I 2 II 

1 x 1 x 1 y 

II 1 2 

1 x 1 V 1 y 


(4.8) 


where we have replaced the weighted summations with discrete convolutions with the weight- 
ing kernel w. This matrix can be interpreted as a tensor (multiband) image, where the outer 
products of the gradients V/ are convolved with a weighting function w to provide a per-pixel 
estimate of the local (quadratic) shape of the auto-correlation function. 

As first shown by Anandan (1984; 1989) and further discussed in Section 8.1.3 and ( 8 .44), 
the inverse of the matrix A provides a lower bound on the uncertainty in the location of a 
matching patch. It is therefore a useful indicator of which patches can be reliably matched. 
The easiest way to visualize and reason about this uncertainty is to perform an eigenvalue 
analysis of the auto-correlation matrix A , which produces two eigenvalues (Ao, Ai) and two 
eigenvector directions (Figure 4.6). Since the larger uncertainty depends on the smaller eigen- 
value, i.e., A 0 , it makes sense to find maxima in the smaller eigenvalue to locate good 
features to track (Shi and Tomasi 1994). 


Forstner-Harris. While Anandan and Lucas and Kanade (1981) were the first to analyze 
the uncertainty structure of the auto-correlation matrix, they did so in the context of asso- 
ciating certainties with optic flow measurements. Forstner (1986) and Harris and Stephens 
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direction of the 



direction of the 
slowest change 


Figure 4.6 Uncertainty ellipse corresponding to an eigenvalue analysis of the auto- 
correlation matrix A. 


(1988) were the first to propose using local maxima in rotationally invariant scalar measures 
derived from the auto-correlation matrix to locate keypoints for the purpose of sparse feature 
matching. (Schmid, Mohr, and Bauckhage (2000); Triggs (2004) give more detailed histori- 
cal reviews of feature detection algorithms.) Both of these techniques also proposed using a 
Gaussian weighting window instead of the previously used square patches, which makes the 
detector response insensitive to in-plane image rotations. 

The minimum eigenvalue Ao (Shi and Tomasi 1994) is not the only quantity that can be 
used to find keypoints. A simpler quantity, proposed by Hams and Stephens (1988), is 

det(A) — a trace(A) 2 = AoAi — a(Ao + Ai) 2 (4.9) 

with a = 0.06. Unlike eigenvalue analysis, this quantity does not require the use of square 
roots and yet is still rotationally invariant and also downweights edge-like features where 
Ai Ao- Triggs (2004) suggests using the quantity 

Ao-aAi (4.10) 


(say, with a = 0.05), which also reduces the response at ID edges, where aliasing errors 
sometimes inflate the smaller eigenvalue. He also shows how the basic 2x2 Hessian can be 
extended to parametric motions to detect points that are also accurately localizable in scale 
and rotation. Brown, Szeliski, and Winder (2005), on the other hand, use the harmonic mean. 


det A AoAi 
tr A Aq + Ai 


(4.11) 


which is a smoother function in the region where Ao ~ Ai. Figure 4.7 shows isocontours 
of the various interest point operators, from which we can see how the two eigenvalues are 
blended to determine the final interest value. 
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Figure 4.7 Isocontours of popular keypoint detection functions (Brown, Szeliski, and 
Winder 2004). Each detector looks for points where the eigenvalues Ao,Ai of A = 
w * V/V/ T are both large. 
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1. Compute the horizontal and vertical derivatives of the image I x and I y by con- 
volving the original image with derivatives of Gaussians (Section 3.2.3). 

2. Compute the three images corresponding to the outer products of these gradients. 
(The matrix A is symmetric, so only three entries are needed.) 

3. Convolve each of these images with a larger Gaussian. 

4. Compute a scalar interest measure using one of the formulas discussed above. 

5. Find local maxima above a certain threshold and report them as detected feature 
point locations. 


Algorithm 4.1 Outline of a basic feature detection algorithm. 
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Figure 4.8 Interest operator responses: (a) Sample image, (b) Harris response, and (c) DoG 
response. The circle sizes and colors indicate the scale at which each interest point was 
detected. Notice how the two detectors tend to respond at complementary locations. 


The steps in the basic auto-correlation-based keypoint detector are summarized in Algo- 
rithm 4.1. Figure 4.8 shows the resulting interest operator responses for the classic Harris 
detector as well as the difference of Gaussian (DoG) detector discussed below. 

Adaptive non-maximal suppression (ANMS). While most feature detectors simply look 
for local maxima in the interest function, this can lead to an uneven distribution of feature 
points across the image, e.g., points will be denser in regions of higher contrast. To mitigate 
this problem. Brown, Szeliski, and Winder (2005) only detect features that are both local 
maxima and whose response value is significantly (10%) greater than that of all of its neigh- 
bors within a radius r (Figure 4.9c-d). They devise an efficient way to associate suppression 
radii with all local maxima by first sorting them by their response strength and then creating 
a second list sorted by decreasing suppression radius (Brown, Szeliski, and Winder 2005). 
Figure 4.9 shows a qualitative comparison of selecting the top n features and using ANMS. 


Measuring repeatability. Given the large number of feature detectors that have been de- 
veloped in computer vision, how can we decide which ones to use? Schmid, Mohr, and 
Bauckhage (2000) were the first to propose measuring the repeatability of feature detectors, 
which they define as the frequency with which keypoints detected in one image are found 
within e (say, e = 1.5) pixels of the corresponding location in a transformed image. In their 
paper, they transform their planar images by applying rotations, scale changes, illumination 
changes, viewpoint changes, and adding noise. They also measure the information content 
available at each detected feature point, which they define as the entropy of a set of rotation- 
ally invariant local grayscale descriptors. Among the techniques they survey, they find that 
the improved (Gaussian derivative) version of the Harris operator with cr^ = 1 (scale of the 
derivative Gaussian) and crj = 2 (scale of the integration Gaussian) works best. 
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(a) Strongest 250 


(b) Strongest 500 




(c) ANMS 250, r = 24 (d) ANMS 500, r = 16 

Figure 4.9 Adaptive non-maximal suppression (ANMS) (Brown, Szeliski, and Winder 
2005) © 2005 IEEE: The upper two images show the strongest 250 and 500 interest points, 
while the lower two images show the interest points selected with adaptive non-maximal sup- 
pression, along with the corresponding suppression radius r. Note how the latter features 
have a much more uniform spatial distribution across the image. 


Scale invariance 

In many situations, detecting features at the finest stable scale possible may not be appro- 
priate. For example, when matching images with little high frequency detail (e.g., clouds), 
fine-scale features may not exist. 

One solution to the problem is to extract features at a variety of scales, e.g., by performing 
the same operations at multiple resolutions in a pyramid and then matching features at the 
same level. This kind of approach is suitable when the images being matched do not undergo 
large scale changes, e.g., when matching successive aerial images taken from an airplane or 
stitching panoramas taken with a fixed-focal-length camera. Figure 4.10 shows the output of 
one such approach, the multi-scale, oriented patch detector of Brown, Szeliski, and Winder 
(2005), for which responses at five different scales are shown. 

However, for most object recognition applications, the scale of the object in the image 
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Figure 4.10 Multi-scale oriented patches (MOPS) extracted at five pyramid levels (Brown, 
Szeliski, and Winder 2005) © 2005 IEEE. The boxes show the feature orientation and the 
region from which the descriptor vectors are sampled. 


is unknown. Instead of extracting features at many different scales and then matching all of 
them, it is more efficient to extract features that are stable in both location and scale (Lowe 
2004; Mikolajczyk and Schmid 2004). 

Early investigations into scale selection were performed by Lindeberg (1993; 1998b), 
who first proposed using extrema in the Laplacian of Gaussian (LoG) function as interest 
point locations. Based on this work, Lowe (2004) proposed computing a set of sub-octave 
Difference of Gaussian filters (Figure 4.1 la), looking for 3D (space+scale) maxima in the re- 
sulting structure (Figure 4.1 lb), and then computing a sub-pixel space+scale location using a 
quadratic fit (Brown and Lowe 2002). The number of sub-octave levels was determined, after 
careful empirical investigation, to be three, which corresponds to a quarter-octave pyramid, 
which is the same as used by Triggs (2004). 

As with the Harris operator, pixels where there is strong asymmetry in the local curvature 
of the indicator function (in this case, the DoG) are rejected. This is implemented by first 
computing the local Hessian of the difference image D, 


H = 


Dxx xJxy 

D X y ^ 'yy 


and then rejecting keypoints for which 


Tr(iT) 2 

Det(iT) 


> 10 . 


(4.12) 


(4.13) 
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Figure 4.11 Scale-space feature detection using a sub-octave Difference of Gaussian pyra- 
mid (Lowe 2004) © 2004 Springer: (a) Adjacent levels of a sub-octave Gaussian pyramid 
are subtracted to produce Difference of Gaussian images; (b) extrema (maxima and minima) 
in the resulting 3D volume are detected by comparing a pixel to its 26 neighbors. 


While Lowe’s Scale Invariant Feature Transform (SIFT) performs well in practice, it is not 
based on the same theoretical foundation of maximum spatial stability as the auto-correlation- 
based detectors. (In fact, its detection locations are often complementary to those produced 
by such techniques and can therefore be used in conjunction with these other approaches.) 
In order to add a scale selection mechanism to the Harris corner detector, Mikolajczyk and 
Schmid (2004) evaluate the Laplacian of Gaussian function at each detected Harris point (in 
a multi-scale pyramid) and keep only those points for which the Laplacian is extremal (larger 
or smaller than both its coarser and finer-level values). An optional iterative refinement for 
both scale and position is also proposed and evaluated. Additional examples of scale invariant 
region detectors are discussed by Mikolajczyk, Tuytelaars, Schmid et al. (2005); Tuytelaars 
and Mikolajczyk (2007). 


Rotational invariance and orientation estimation 

In addition to dealing with scale changes, most image matching and object recognition algo- 
rithms need to deal with (at least) in-plane image rotation. One way to deal with this problem 
is to design descriptors that are rotationally invariant (Schmid and Mohr 1997), but such 
descriptors have poor discriminability, i.e. they map different looking patches to the same 
descriptor. 
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Figure 4.12 A dominant orientation estimate can be computed by creating a histogram of 
all the gradient orientations (weighted by their magnitudes or after thresholding out small 
gradients) and then finding the significant peaks in this distribution (Lowe 2004) © 2004 
Springer. 


A better method is to estimate a dominant orientation at each detected keypoint. Once 
the local orientation and scale of a keypoint have been estimated, a scaled and oriented patch 
around the detected point can be extracted and used to form a feature descriptor (Figures 4. 10 
and 4.17). 

The simplest possible orientation estimate is the average gradient within a region around 
the keypoint. If a Gaussian weighting function is used (Brown, Szeliski, and Winder 2005), 
this average gradient is equivalent to a first-order steerable filter (Section 3.2.3), i.e., it can be 
computed using an image convolution with the horizontal and vertical derivatives of Gaus- 
sian filter (Freeman and Adelson 1991). In order to make this estimate more reliable, it is 
usually preferable to use a larger aggregation window (Gaussian kernel size) than detection 
window (Brown, Szeliski, and Winder 2005). The orientations of the square boxes shown in 
Figure 4.10 were computed using this technique. 

Sometimes, however, the averaged (signed) gradient in a region can be small and therefore 
an unreliable indicator of orientation. A more reliable technique is to look at the histogram 
of orientations computed around the keypoint. Lowe (2004) computes a 36-bin histogram 
of edge orientations weighted by both gradient magnitude and Gaussian distance to the cen- 
ter, finds all peaks within 80% of the global maximum, and then computes a more accurate 
orientation estimate using a three -bin parabolic fit (Figure 4.12). 

Affine invariance 

While scale and rotation invariance are highly desirable, for many applications such as wide 
baseline stereo matching (Pritchett and Zisserman 1998; Schaffalitzky and Zisserman 2002) 
or location recognition (Chum, Philbin, Sivic et al. 2007), full affine invariance is preferred. 
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Figure 4.13 Affine region detectors used to match two images taken from dramatically 
different viewpoints (Mikolajczyk and Schmid 2004) © 2004 Springer. 



Figure 4.14 Affine normalization using the second moment matrices, as described by Miko- 
lajczyk, Tuytelaars, Schmid etal. (2005) © 2005 Springer. Afterimage coordinates are trans- 
formed using the matrices A Q ' and A 1 1 , they are related by a pure rotation R, which 
can be estimated using a dominant orientation technique. 


Affine-invariant detectors not only respond at consistent locations after scale and orientation 
changes, they also respond consistently across affine deformations such as (local) perspective 
foreshortening (Figure 4.13). In fact, for a small enough patch, any continuous image warping 
can be well approximated by an affine deformation. 

To introduce affine invariance, several authors have proposed fitting an ellipse to the auto- 
correlation or Hessian matrix (using eigenvalue analysis) and then using the principal axes 
and ratios of this fit as the affine coordinate frame (Lindeberg and Garding 1997; Baumberg 
2000; Mikolajczyk and Schmid 2004; Mikolajczyk, Tuytelaars, Schmid et al. 2005; Tuyte- 
laars and Mikolajczyk 2007). Figure 4.14 shows how the square root of the moment matrix 
can be used to transform local patches into a frame which is similar up to rotation. 

Another important affine invariant region detector is the maximally stable extremal region 
(MSER) detector developed by Matas, Chum, Urban et al. (2004). To detect MSERs, binary 
regions are computed by thresholding the image at all possible gray levels (the technique 
therefore only works for grayscale images). This operation can be performed efficiently by 
first sorting all pixels by gray value and then incrementally adding pixels to each connected 
component as the threshold is changed (Nister and Stewenius 2008). As the threshold is 
changed, the area of each component (region) is monitored; regions whose rate of change of 
area with respect to the threshold is minimal are defined as maximally stable. Such regions 
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Figure 4.15 Maximally stable extremal regions (MSERs) extracted and matched from a 
number of images (Matas, Chum, Urban el al. 2004) © 2004 Elsevier. 



Figure 4.16 Feature matching: how can we extract local descriptors that are invariant 
to inter-image variations and yet still discriminative enough to establish correct correspon- 
dences? 

are therefore invariant to both affine geometric and photometric (linear bias-gain or smooth 
monotonic) transformations (Figure 4.15). If desired, an affine coordinate frame can be fit to 
each detected region using its moment matrix. 

The area of feature point detectors continues to be very active, with papers appearing ev- 
ery year at major computer vision conferences (Xiao and Shah 2003; Koethe 2003; Carneiro 
and Jepson 2005; Kenney, Zuliani, and Manjunath 2005; Bay, Tuytelaars, and Van Gool 2006; 
Platel, Balmachnova, Florack el al. 2006; Rosten and Drummond 2006). Mikolajczyk, Tuyte- 
laars, Schmid et al (2005) survey a number of popular affine region detectors and provide 
experimental comparisons of their invariance to common image transformations such as scal- 
ing, rotations, noise, and blur. These experimental results, code, and pointers to the surveyed 
papers can be found on their Web site at http://www.robots.ox.ac.uk/~vgg/research/affine/. 

Of course, keypoints are not the only features that can be used for registering images. 
Zoghlami, Faugeras, and Deriche (1997) use line segments as well as point-like features to 
estimate homographies between pairs of images, whereas Bartoli, Coquerelle, and Sturm 
(2004) use line segments with local correspondences along the edges to extract 3D structure 
and motion. Tuytelaars and Van Gool (2004) use affine invariant regions to detect corre- 
spondences for wide baseline stereo matching, whereas Kadir, Zisserman, and Brady (2004) 
detect salient regions where patch entropy and its rate of change with scale are locally max- 
imal. Corso and Hager (2005) use a related technique to fit 2D oriented Gaussian kernels 
to homogeneous regions. More details on techniques for finding and matching curves, lines, 
and regions can be found later in this chapter. 
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Figure 4.17 MOPS descriptors are formed using an 8 x 8 sampling of bias and gain nor- 
malized intensity values, with a sample spacing of five pixels relative to the detection scale 
(Brown, Szeliski, and Winder 2005) © 2005 IEEE. This low frequency sampling gives the 
features some robustness to interest point location error and is achieved by sampling at a 
higher pyramid level than the detection scale. 


4.1.2 Feature descriptors 

After detecting features (keypoints), we must match them, i.e., we must determine which 
features come from corresponding locations in different images. In some situations, e.g., for 
video sequences (Shi and Tomasi 1994) or for stereo pairs that have been rectified (Zhang, 
Deriche, Faugeras et al. 1995; Loop and Zhang 1999; Scharstein and Szeliski 2002), the lo- 
cal motion around each feature point may be mostly translational. In this case, simple error 
metrics, such as the sum of squared differences or normalized cross-correlation, described 
in Section 8.1 can be used to directly compare the intensities in small patches around each 
feature point. (The comparative study by Mikolajczyk and Schmid (2005), discussed below, 
uses cross-correlation.) Because feature points may not be exactly located, a more accurate 
matching score can be computed by performing incremental motion refinement as described 
in Section 8.1.3 but this can be time consuming and can sometimes even decrease perfor- 
mance (Brown, Szeliski, and Winder 2005). 

In most cases, however, the local appearance of features will change in orientation and 
scale, and sometimes even undergo affine deformations. Extracting a local scale, orientation, 
or affine frame estimate and then using this to resample the patch before forming the feature 
descriptor is thus usually preferable (Figure 4.17). 

Even after compensating for these changes, the local appearance of image patches will 
usually still vary from image to image. How can we make image descriptors more invariant to 
such changes, while still preserving discriminability between different (non-corresponding) 
patches (Figure 4.16)? Mikolajczyk and Schmid (2005) review some recently developed 
view-invariant local image descriptors and experimentally compare their performance. Be- 
low, we describe a few of these descriptors in more detail. 
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Bias and gain normalization (MOPS). For tasks that do not exhibit large amounts of fore- 
shortening, such as image stitching, simple normalized intensity patches perform reasonably 
well and are simple to implement (Brown, Szeliski, and Winder 2005) (Figure 4.17). In or- 
der to compensate for slight inaccuracies in the feature point detector (location, orientation, 
and scale), these multi-scale oriented patches (MOPS) are sampled at a spacing of five pixels 
relative to the detection scale, using a coarser level of the image pyramid to avoid aliasing. 
To compensate for affine photometric variations (linear exposure changes or bias and gain, 
(3.3)), patch intensities are re-scaled so that their mean is zero and their variance is one. 

Scale invariant feature transform (SIFT). SIFT features are formed by computing the 
gradient at each pixel in a 16 x 16 window around the detected keypoint, using the appropriate 
level of the Gaussian pyramid at which the keypoint was detected. The gradient magnitudes 
are downweighted by a Gaussian fall-off function (shown as a blue circle in (Figure 4.18a) in 
order to reduce the influence of gradients far from the center, as these are more affected by 
small misregistrations. 

In each 4x4 quadrant, a gradient orientation histogram is formed by (conceptually) 
adding the weighted gradient value to one of eight orientation histogram bins. To reduce the 
effects of location and dominant orientation misestimation, each of the original 256 weighted 
gradient magnitudes is softly added to 2 x 2 x 2 histogram bins using trilinear interpolation. 
Softly distributing values to adjacent histogram bins is generally a good idea in any appli- 
cation where histograms are being computed, e.g., for Hough transforms (Section 4.3.2) or 
local histogram equalization (Section 3.1.4). 

The resulting 128 non-negative values form a raw version of the SIFT descriptor vector. 
To reduce the effects of contrast or gain (additive variations are already removed by the gra- 
dient), the 128-D vector is normalized to unit length. To further make the descriptor robust to 
other photometric variations, values are clipped to 0.2 and the resulting vector is once again 
renormalized to unit length. 

PCA-SIFT. Ke and Sukthankar (2004) propose a simpler way to compute descriptors in- 
spired by SIFT; it computes the x and y (gradient) derivatives over a 39 x 39 patch and 
then reduces the resulting 3042-dimensional vector to 36 using principal component analysis 
(PCA) (Section 14.2.1 and Appendix A. 1.2). Another popular variant of SIFT is SURF (Bay, 
Tuytelaars, and Van Gool 2006), which uses box filters to approximate the derivatives and 
integrals used in SIFT. 

Gradient location-orientation histogram (GLOH). This descriptor, developed by Miko- 
lajczyk and Schmid (2005), is a variant on SIFT that uses a log-polar binning structure instead 
of the four quadrants used by Lowe (2004) (Figure 4.19). The spatial bins are of radius 6, 
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(a) image gradients 




(b) keypoint descriptor 


Figure 4.18 A schematic representation of Lowe’s (2004) scale invariant feature transform 
(SIFT): (a) Gradient orientations and magnitudes are computed at each pixel and weighted 
by a Gaussian fall-off function (blue circle), (b) A weighted gradient orientation histogram 
is then computed in each subregion, using trilinear interpolation. While this figure shows an 
8x8 pixel patch and a 2 x 2 descriptor array, Lowe’s actual implementation uses 16 x 16 
patches and a 4 x 4 array of eight-bin histograms. 


11, and 15, with eight angular bins (except for the central region), for a total of 17 spa- 
tial bins and 16 orientation bins. The 272-dimensional histogram is then projected onto 
a 128-dimensional descriptor using PCA trained on a large database. In their evaluation, 
Mikolajczyk and Schmid (2005) found that GLOH, which has the best performance overall, 
outperforms SIFT by a small margin. 

Steerable filters. Steerable filters (Section 3.2.3) are combinations of derivative of Gaus- 
sian filters that permit the rapid computation of even and odd (symmetric and anti-symmetric) 
edge-like and corner-like features at all possible orientations (Freeman and Adelson 1991). 
Because they use reasonably broad Gaussians, they too are somewhat insensitive to localiza- 
tion and orientation errors. 

Performance of local descriptors. Among the local descriptors that Mikolajczyk and Schmid 
(2005) compared, they found that GLOH performed best, followed closely by SIFT (see Fig- 
ure 4.25). They also present results for many other descriptors not covered in this book. 

The field of feature descriptors continues to evolve rapidly, with some of the newer tech- 
niques looking at local color information (van de Weijer and Schmid 2006; Abdel-Hakim 
and Farag 2006). Winder and Brown (2007) develop a multi-stage framework for feature 
descriptor computation that subsumes both SIFT and GLOH (Figure 4.20a) and also allows 
them to learn optimal parameters for newer descriptors that outperform previous hand-tuned 
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(a) image gradients 



(b) keypoint descriptor 


Figure 4.19 The gradient location-orientation histogram (GLOH) descriptor uses log-polar 
bins instead of square bins to compute orientation histograms (Mikolajczyk and Schmid 
2005). 


descriptors. Hua, Brown, and Winder (2007) extend this work by learning lower-dimensional 
projections of higher-dimensional descriptors that have the best discriminative power. Both 
of these papers use a database of real-world image patches (Figure 4.20b) obtained by sam- 
pling images at locations that were reliably matched using a robust structure-from-motion 
algorithm applied to Internet photo collections (Snavely, Seitz, and Szeliski 2006; Goesele, 
Snavely, Curless et al. 2007). In concurrent work. Tola, Lepetit, and Fua (2010) developed a 
similar DAISY descriptor for dense stereo matching and optimized its parameters based on 
ground truth stereo data. 

While these techniques construct feature detectors that optimize for repeatability across 
all object classes, it is also possible to develop class- or instance-specific feature detectors that 
maximize discriminability from other classes (Ferencz, Learned-Miller, and Malik 2008). 


4.1.3 Feature matching 

Once we have extracted features and their descriptors from two or more images, the next step 
is to establish some preliminary feature matches between these images. In this section, we 
divide this problem into two separate components. The first is to select a matching strategy , 
which determines which correspondences are passed on to the next stage for further process- 
ing. The second is to devise efficient data structures and algorithms to perform this matching 
as quickly as possible. (See the discussion of related techniques in Section 14.3.2.) 
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Si: 9FT grid with 

bilinear weights 

Figure 4.20 Spatial summation blocks for SIFT, GLOH, and some newly developed feature 
descriptors (Winder and Brown 2007) © 2007 IEEE: (a) The parameters for the new features, 
e.g., their Gaussian weights, are learned from a training database of (b) matched real-world 
image patches obtained from robust structure from motion applied to Internet photo collec- 
tions (Hua, Brown, and Winder 2007). 

Matching strategy and error rates 

Determining which feature matches are reasonable to process further depends on the context 
in which the matching is being performed. Say we are given two images that overlap to a fair 
amount (e.g., for image stitching, as in Figure 4.16, or for tracking objects in a video). We 
know that most features in one image are likely to match the other image, although some may 
not match because they are occluded or their appearance has changed too much. 

On the other hand, if we are trying to recognize how many known objects appear in a clut- 
tered scene (Figure 4.21), most of the features may not match. Furthermore, a large number 
of potentially matching objects must be searched, which requires more efficient strategies, as 
described below. 

To begin with, we assume that the feature descriptors have been designed so that Eu- 
clidean (vector magnitude) distances in feature space can be directly used for ranking poten- 
tial matches. If it turns out that certain parameters (axes) in a descriptor are more reliable 
than others, it is usually preferable to re-scale these axes ahead of time, e.g., by determin- 
ing how much they vary when compared against other known good matches (Hua, Brown, 
and Winder 2007). A more general process, which involves transforming feature vectors 
into a new scaled basis, is called whitening and is discussed in more detail in the context of 
eigenface-based face recognition (Section 14.2.1). 

Given a Euclidean distance metric, the simplest matching strategy is to set a threshold 
(maximum distance) and to return all matches from other images within this threshold. Set- 
ting the threshold too high results in too many false positives , i.e., incorrect matches being 
returned. Setting the threshold too low results in too many false negatives, i.e., too many 
correct matches being missed (Figure 4.22). 

We can quantify the performance of a matching algorithm at a particular threshold by 
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Figure 4.21 Recognizing objects in a cluttered scene (Lowe 2004) © 2004 Springer. Two of 
the training images in the database are shown on the left. These are matched to the cluttered 
scene in the middle using SIFT features, shown as small squares in the right image. The affine 
warp of each recognized database image onto the scene is shown as a larger parallelogram in 
the right image. 



Figure 4.22 False positives and negatives: The black digits 1 and 2 are features being 
matched against a database of features in other images. At the current threshold setting (the 
solid circles), the green 1 is a true positive (good match), the blue 1 is a false negative (failure 
to match), and the red 3 is a false positive (incorrect match). If we set the threshold higher 
(the dashed circles), the blue 1 becomes a true positive but the brown 4 becomes an additional 
false positive. 
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True matches True non-matches 


Predicted matches 
Predicted non-matches 


TP = 18 

FP = 4 

P' = 22 

FN = 2 

TN= 76 

N' = 78 

P= 20 

N= 80 

Total = 100 


PPV = 0.82 


TPR = 0.90 FPR = 0.05 


ACC = 0.94 


Table 4.1 The number of matches correctly and incorrectly estimated by a feature matching 
algorithm, showing the number of true positives (TP), false positives (FP), false negatives 
(FN) and true negatives (TN). The columns sum up to the actual number of positives (P) and 
negatives (N), while the rows sum up to the predicted number of positives (P’) and negatives 
(N’). The formulas for the true positive rate (TPR), the false positive rate (FPR), the positive 
predictive value (PPV), and the accuracy (ACC) are given in the text. 


first counting the number of true and false matches and match failures, using the following 
definitions (Fawcett 2006): 

• TP: true positives, i.e., number of correct matches; 

• FN: false negatives, matches that were not correctly detected; 

• FP: false positives, proposed matches that are incorrect; 

• TN: true negatives, non-matches that were correctly rejected. 

Table 4.1 shows a sample confusion matrix (contingency table) containing such numbers. 

We can convert these numbers into unit rates by defining the following quantities (Fawcett 
2006): 


• true positive rate (TPR), 

TPR = 

TP 

TP 

(4.14) 


TP+FN 

P ’ 

• false positive rate (FPR), 

FPR = 

FP 

FP 

(4.15) 


FP+TN 

N ’ 

• positive predictive value (PPV), 





PPV = 

TP 

TP 

(4.16) 


TP+FP 

P’ 5 


• accuracy (ACC), 

TP+TN 


ACC = 


P+N 


(4.17) 


true positive rate 
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Figure 4.23 ROC curve and its related rates: (a) The ROC curve plots the true positive rate 
against the false positive rate for a particular combination of feature extraction and match- 
ing algorithms. Ideally, the true positive rate should be close to 1, while the false positive 
rate is close to 0. The area under the ROC curve (AUC) is often used as a single (scalar) 
measure of algorithm performance. Alternatively, the equal error rate is sometimes used, (b) 
The distribution of positives (matches) and negatives (non-matches) as a function of inter- 
feature distance d. As the threshold 9 is increased, the number of true positives (TP) and false 
positives (FP) increases. 


In the information retrieval (or document retrieval) literature (Baeza-Yates and Ribeiro- 
Neto 1999; Manning, Raghavan, and Schiitze 2008), the term precision (how many returned 
documents are relevant) is used instead of PPV and recall (what fraction of relevant docu- 
ments was found) is used instead of TPR. 

Any particular matching strategy (at a particular threshold or parameter setting) can be 
rated by the TPR and FPR numbers; ideally, the true positive rate will be close to 1 and the 
false positive rate close to 0. As we vary the matching threshold, we obtain a family of such 
points, which are collectively known as the receiver operating characteristic ( ROC curve) 
(Fawcett 2006) (Figure 4.23a). The closer this curve lies to the upper left corner, i.e., the 
larger the area under the curve (AUC), the better its performance. Figure 4.23b shows how 
we can plot the number of matches and non-matches as a function of inter-feature distance d. 
These curves can then be used to plot an ROC curve (Exercise 4.3). The ROC curve can also 
be used to calculate the mean average precision , which is the average precision (PPV) as you 
vary the threshold to select the best results, then the two top results, etc. 

The problem with using a fixed threshold is that it is difficult to set; the useful range 
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Figure 4.24 Fixed threshold, nearest neighbor, and nearest neighbor distance ratio matching. 
At a fixed distance threshold (dashed circles), descriptor Da fails to match Du and D /> 
incorrectly matches Dc and I) e • If we pick the nearest neighbor, D,\ correctly matches 1) n 
but Dp incorrectly matches Dc- Using nearest neighbor distance ratio (NNDR) matching, 
the small NNDR di/d 2 correctly matches D ,\ with Db, and the large NNDR d\ / d' 2 correctly 
rejects matches for Do- 


of thresholds can vary a lot as we move to different parts of the feature space (Lowe 2004; 
Mikolajczyk and Schmid 2005). A better strategy in such cases is to simply match the nearest 
neighbor in feature space. Since some features may have no matches (e.g., they may be part 
of background clutter in object recognition or they may be occluded in the other image), a 
threshold is still used to reduce the number of false positives. 

Ideally, this threshold itself will adapt to different regions of the feature space. If sufficient 
training data is available (Hua, Brown, and Winder 2007), it is sometimes possible to learn 
different thresholds for different features. Often, however, we are simply given a collection 
of images to match, e.g., when stitching images or constructing 3D models from unordered 
photo collections (Brown and Lowe 2007, 2003; Snavely, Seitz, and Szeliski 2006). In this 
case, a useful heuristic can be to compare the nearest neighbor distance to that of the second 
nearest neighbor, preferably taken from an image that is known not to match the target (e.g., 
a different object in the database) (Brown and Lowe 2002; Lowe 2004). We can define this 
nearest neighbor distance ratio (Mikolajczyk and Schmid 2005) as 


NNDR = $- = 

d 2 


I Da - D b \ 


(4.18) 


[D A -DcY 

where d t and d 2 are the nearest and second nearest neighbor distances. Da is the target 
descriptor, and Db and Dc are its closest two neighbors (Figure 4.24). 

The effects of using these three different matching strategies for the feature descriptors 
evaluated by Mikolajczyk and Schmid (2005) are shown in Figure 4.25. As you can see, the 
nearest neighbor and NNDR strategies produce improved ROC curves. 
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(a) 



(b) (c) 


Figure 4.25 Performance of the feature descriptors evaluated by Mikolajczyk and Schmid 
(2005) © 2005 IEEE, shown for three matching strategies: (a) fixed threshold; (b) nearest 
neighbor; (c) nearest neighbor distance ratio (NNDR). Note how the ordering of the algo- 
rithms does not change that much, but the overall performance varies significantly between 
the different matching strategies. 
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Figure 4.26 The three Haar wavelet coefficients used for hashing the MOPS descriptor de- 
vised by Brown, Szeliski, and Winder (2005) are computed by summing each 8x8 normalized 
patch over the light and dark gray regions and taking their difference. 


Efficient matching 

Once we have decided on a matching strategy, we still need to search efficiently for poten- 
tial candidates. The simplest way to find all corresponding feature points is to compare all 
features against all other features in each pair of potentially matching images. Unfortunately, 
this is quadratic in the number of extracted features, which makes it impractical for most 
applications. 

A better approach is to devise an indexing structure , such as a multi-dimensional search 
tree or a hash table, to rapidly search for features near a given feature. Such indexing struc- 
tures can either be built for each image independently (which is useful if we want to only 
consider certain potential matches, e.g., searching for a particular object) or globally for all 
the images in a given database, which can potentially be faster, since it removes the need to it- 
erate over each image. For extremely large databases (millions of images or more), even more 
efficient structures based on ideas from document retrieval (e.g., vocabulary trees , (Nister and 
Stewenius 2006)) can be used (Section 14.3.2). 

One of the simpler techniques to implement is multi-dimensional hashing, which maps 
descriptors into fixed size buckets based on some function applied to each descriptor vector. 
At matching time, each new feature is hashed into a bucket, and a search of nearby buckets 
is used to return potential candidates, which can then be sorted or graded to determine which 
are valid matches. 

A simple example of hashing is the Haar wavelets used by Brown, Szeliski, and Winder 
(2005) in their MOPS paper. During the matching structure construction, each 8x8 scaled, 
oriented, and normalized MOPS patch is converted into a three-element index by perform- 
ing sums over different quadrants of the patch (Figure 4.26). The resulting three values are 
normalized by their expected standard deviations and then mapped to the two (of b = 10) 
nearest ID bins. The three-dimensional indices formed by concatenating the three quantized 
values are used to index the 2 3 = 8 bins where the feature is stored (added). At query time, 
only the primary (closest) indices are used, so only a single three-dimensional bin needs to 
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ABODE FGH 


(b) 


Figure 4.27 K-d tree and best bin first (BBF) search (Beis and Lowe 1999) © 1999 IEEE: 
(a) The spatial arrangement of the axis-aligned cutting planes is shown using dashed lines. 
Individual data points are shown as small diamonds, (b) The same subdivision can be repre- 
sented as a tree, where each interior node represents an axis-aligned cutting plane (e.g., the 
top node cuts along dimension dl at value .34) and each leaf node is a data point. During a 
BBF search, a query point (denoted by “+”) first looks in its containing bin (D) and then in 
its nearest adjacent bin (B), rather than its closest neighbor in the tree (C). 


be examined. The coefficients in the bin can then be used to select k approximate nearest 
neighbors for further processing (such as computing the NNDR). 

A more complex, but more widely applicable, version of hashing is called locality sen- 
sitive hashing, which uses unions of independently computed hashing functions to index 
the features (Gionis, Indyk, and Motwani 1999; Shakhnarovich, Darrell, and Indyk 2006). 
Shakhnarovich, Viola, and Darrell (2003) extend this technique to be more sensitive to the 
distribution of points in parameter space, which they call parameter-sensitive hashing. Even 
more recent work converts high-dimensional descriptor vectors into binary codes that can be 
compared using Hamming distances (Torralba, Weiss, and Fergus 2008; Weiss, Torralba, and 
Fergus 2008) or that can accommodate arbitrary kernel functions (Kulis and Grauman 2009; 
Raginsky and Lazebnik 2009). 

Another widely used class of indexing structures are multi-dimensional search trees. The 
best known of these are k-d trees, also often written as fcd-trees, which divide the multi- 
dimensional feature space along alternating axis-aligned hyperplanes, choosing the threshold 
along each axis so as to maximize some criterion, such as the search tree balance (Samet 
1989). Figure 4.27 shows an example of a two-dimensional k-d tree. Here, eight different data 
points A-H are shown as small diamonds arranged on a two-dimensional plane. The k-d tree 
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recursively splits this plane along axis-aligned (horizontal or vertical) cutting planes. Each 
split can be denoted using the dimension number and split value (Figure 4.27b). The splits are 
arranged so as to try to balance the tree, i.e., to keep its maximum depth as small as possible. 
At query time, a classic k-d tree search first locates the query point (+) in its appropriate 
bin (D), and then searches nearby leaves in the tree (C, B, . . .) until it can guarantee that 
the nearest neighbor has been found. The best bin first (BBF) search (Beis and Fowe 1999) 
searches bins in order of their spatial proximity to the query point and is therefore usually 
more efficient. 

Many additional data structures have been developed over the years for solving nearest 
neighbor problems (Arya, Mount, Netanyahu et al. 1998; Fiang, Fiu, Xu el al. 2001; Hjalta- 
son and Samet 2003). For example, Nene and Nayar (1997) developed a technique they call 
slicing that uses a series of ID binary searches on the point list sorted along different dimen- 
sions to efficiently cull down a list of candidate points that lie within a hypercube of the query 
point. Grauman and Darrell (2005) reweight the matches at different levels of an indexing 
tree, which allows their technique to be less sensitive to discretization errors in the tree con- 
struction. Nister and Stewenius (2006) use a metric tree, which compares feature descriptors 
to a small number of prototypes at each level in a hierarchy. The resulting quantized visual 
words can then be used with classical information retrieval (document relevance) techniques 
to quickly winnow down a set of potential candidates from a database of millions of images 
(Section 14.3.2). Muja and Fowe (2009) compare a number of these approaches, introduce a 
new one of their own (priority search on hierarchical k-means trees), and conclude that mul- 
tiple randomized k-d trees often provide the best performance. Despite all of this promising 
work, the rapid computation of image feature correspondences remains a challenging open 
research problem. 

Feature match verification and densification 

Once we have some hypothetical (putative) matches, we can often use geometric alignment 
(Section 6.1) to verify which matches are inliers and which ones are outliers. For example, 
if we expect the whole image to be translated or rotated in the matching view, we can fit a 
global geometric transform and keep only those feature matches that are sufficiently close to 
this estimated transformation. The process of selecting a small set of seed matches and then 
verifying a larger set is often called random sampling or RANSAC (Section 6.1.4). Once an 
initial set of correspondences has been established, some systems look for additional matches, 
e.g., by looking for additional correspondences along epipolar lines (Section 11.1) or in the 
vicinity of estimated locations based on the global transform. These topics are discussed 
further in Sections 6.1, 11.2, and 14.3.1. 
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4.1.4 Feature tracking 

An alternative to independently finding features in all candidate images and then matching 
them is to find a set of likely feature locations in a first image and to then search for their 
corresponding locations in subsequent images. This kind of detect then track approach is 
more widely used for video tracking applications, where the expected amount of motion and 
appearance deformation between adjacent frames is expected to be small. 

The process of selecting good features to track is closely related to selecting good features 
for more general recognition applications. In practice, regions containing high gradients in 
both directions, i.e., which have high eigenvalues in the auto-correlation matrix (4.8), provide 
stable locations at which to find correspondences (Shi and Tomasi 1994). 

In subsequent frames, searching for locations where the corresponding patch has low 
squared difference (4.1) often works well enough. However, if the images are undergo- 
ing brightness change, explicitly compensating for such variations (8.9) or using normalized 
cross-correlation (8.11) may be preferable. If the search range is large, it is also often more 
efficient to use a hierarchical search strategy, which uses matches in lower-resolution images 
to provide better initial guesses and hence speed up the search (Section 8.1.1). Alternatives 
to this strategy involve learning what the appearance of the patch being tracked should be and 
then searching for it in the vicinity of its predicted position (Avidan 2001; Jurie and Dhome 
2002; Williams, Blake, and Cipolla 2003). These topics are all covered in more detail in 
Section 8.1.3. 

If features are being tracked over longer image sequences, their appearance can undergo 
larger changes. You then have to decide whether to continue matching against the originally 
detected patch (feature) or to re-sample each subsequent frame at the matching location. The 
former strategy is prone to failure as the original patch can undergo appearance changes such 
as foreshortening. The latter runs the risk of the feature drifting from its original location 
to some other location in the image (Shi and Tomasi 1994). (Mathematically, small mis- 
registration errors compound to create a Markov Random Walk, which leads to larger drift 
over time.) 

A preferable solution is to compare the original patch to later image locations using an 
affine motion model (Section 8.2). Shi and Tomasi (1994) first compare patches in neigh- 
boring frames using a translational model and then use the location estimates produced by 
this step to initialize an affine registration between the patch in the current frame and the 
base frame where a feature was first detected (Figure 4.28). In their system, features are only 
detected infrequently, i.e., only in regions where tracking has failed. In the usual case, an 
area around the current predicted location of the feature is searched with an incremental reg- 
istration algorithm (Section 8.1.3). The resulting tracker is often called the Kanade-Lucas- 
Tomasi (KLT) tracker. 
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Figure 4.28 Feature tracking using an affine motion model (Shi and Tomasi 1994) © 1994 
IEEE, Top row: image patch around the tracked feature location. Bottom row: image patch 
after warping back toward the first frame using an affine deformation. Even though the speed 
sign gets larger from frame to frame, the affine transformation maintains a good resemblance 
between the original and subsequent tracked frames. 


Since their original work on feature tracking, Shi and Tomasi’s approach has generated a 
string of interesting follow-on papers and applications. Beardsley, Torr, and Zisserman (1996) 
use extended feature tracking combined with structure from motion (Chapter 7) to incremen- 
tally build up sparse 3D models from video sequences. Kang, Szeliski, and Shum (1997) 
tie together the corners of adjacent (regularly gridded) patches to provide some additional 
stability to the tracking, at the cost of poorer handling of occlusions. Tommasini, Fusiello, 
Trucco el al. (1998) provide a better spurious match rejection criterion for the basic Shi and 
Tomasi algorithm, Collins and Liu (2003) provide improved mechanisms for feature selec- 
tion and dealing with larger appearance changes over time, and Shafique and Shah (2005) 
develop algorithms for feature matching (data association) for videos with large numbers of 
moving objects or points. Yilmaz, Javed, and Shah (2006) and Lepetit and Fua (2005) survey 
the larger field of object tracking, which includes not only feature-based techniques but also 
alternative techniques based on contour and region (Section 5.1). 

One of the newest developments in feature tracking is the use of learning algorithms to 
build special-purpose recognizers to rapidly search for matching features anywhere in an 
image (Lepetit, Pilet, and Fua 2006; Hinterstoisser, Benhimane, Navab el al. 2008; Rogez, 
Rihan, Ramalingam et al. 2008; Ozuysal, Calonder, Lepetit el al. 2010). 2 By taking the time 
to train classifiers on sample patches and their affine deformations, extremely fast and reliable 
feature detectors can be constructed, which enables much faster motions to be supported 
(Figure 4.29). Coupling such features to deformable models (Pilet, Lepetit, and Fua 2008) or 
structure-from-motion algorithms (Klein and Murray 2008) can result in even higher stability. 

2 See also my previous comment on earlier work in learning-based tracking (Avidan 2001: Jurie and Dhome 
2002; Williams, Blake, and Cipolla 2003). 
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Figure 4.29 Real-time head tracking using the fast trained classifiers of Lepetit, Pilet, and 
Fua (2004) © 2004 IEEE. 


4.1.5 Application : Performance-driven animation 

One of the most compelling applications of fast feature tracking is performance-driven an- 
imation, i.e., the interactive deformation of a 3D graphics model based on tracking a user’s 
motions (Williams 1990; Litwinowicz and Williams 1994; Lepetit, Pilet, and Fua 2004). 

Buck, Finkelstein, Jacobs et al. (2000) present a system that tracks a user’s facial expres- 
sions and head motions and then uses them to morph among a series of hand-drawn sketches. 
An animator first extracts the eye and mouth regions of each sketch and draws control lines 
over each image (Figure 4.30a). At run time, a face-tracking system (Toyama 1998) deter- 
mines the current location of these features (Figure 4.30b). The animation system decides 
which input images to morph based on nearest neighbor feature appearance matching and 
triangular barycentric interpolation. It also computes the global location and orientation of 
the head from the tracked features. The resulting morphed eye and mouth regions are then 
composited back into the overall head model to yield a frame of hand-drawn animation (Fig- 
ure 4.30d). 

In more recent work, Barnes, Jacobs, Sanders et al. (2008) watch users animate paper 
cutouts on a desk and then turn the resulting motions and drawings into seamless 2D anima- 
tions. 
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Figure 4.30 Performance-driven, hand-drawn animation (Buck, Finkelstein, Jacobs et al. 
2000) © 2000 ACM: (a) eye and mouth portions of hand-drawn sketch with their overlaid 
control lines; (b) an input video frame with the tracked features overlaid; (c) a different input 
video frame along with its (d) corresponding hand-drawn animation. 


4.2 Edges 

While interest points are useful for finding image locations that can be accurately matched 
in 2D, edge points are far more plentiful and often carry important semantic associations. 
For example, the boundaries of objects, which also correspond to occlusion events in 3D, are 
usually delineated by visible contours. Other kinds of edges correspond to shadow boundaries 
or crease edges, where surface orientation changes rapidly. Isolated edge points can also be 
grouped into longer curves or contours, as well as straight line segments (Section 4.3). It 
is interesting that even young children have no difficulty in recognizing familiar objects or 
animals from such simple line drawings. 

4.2.1 Edge detection 

Given an image, how can we find the salient edges? Consider the color images in Figure 4.31. 
If someone asked you to point out the most “salient” or “strongest” edges or the object bound- 
aries (Martin, Fowlkes, and Malik 2004; Arbelaez, Maire, Fowlkes et al. 2010), which ones 
would you trace? How closely do your perceptions match the edge images shown in Fig- 
ure 4.31? 

Qualitatively, edges occur at boundaries between regions of different color, intensity, or 
texture. Unfortunately, segmenting an image into coherent regions is a difficult task, which 
we address in Chapter 5. Often, it is preferable to detect edges using only purely local infor- 
mation. 

Under such conditions, a reasonable approach is to define an edge as a location of rapid 
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Figure 4.31 Human boundary detection (Martin, Fowlkes, and Malik 2004) © 2004 IEEE. 
The darkness of the edges corresponds to how many human subjects marked an object bound- 
ary at that location. 


intensity variation. 3 Think of an image as a height held. On such a surface, edges occur 
at locations of steep slopes , or equivalently, in regions of closely packed contour lines (on a 
topographic map). 

A mathematical way to define the slope and direction of a surface is through its gradient, 

J(®)=VJ(*) = (^,^)(*). (4.19) 

The local gradient vector J points in the direction of steepest ascent in the intensity function. 
Its magnitude is an indication of the slope or strength of the variation, while its orientation 
points in a direction perpendicular to the local contour. 

Unfortunately, taking image derivatives accentuates high frequencies and hence amplifies 
noise, since the proportion of noise to signal is larger at high frequencies. It is therefore 
prudent to smooth the image with a low-pass filter prior to computing the gradient. Because 
we would like the response of our edge detector to be independent of orientation, a circularly 
symmetric smoothing filter is desirable. As we saw in Section 3.2, the Gaussian is the only 
separable circularly symmetric filter and so it is used in most edge detection algorithms. 
Canny (1986) discusses alternative filters and a number of researcher review alternative edge 
detection algorithms and compare their performance (Davis 1975; Nalwa and Binford 1986; 
Nalwa 1987; Deriche 1987; Freeman and Adelson 1991; Nalwa 1993; Heath, Sarkar, Sanocki 
et al. 1998; Crane 1997; Ritter and Wilson 2000; Bowyer, Kranenburg, and Dougherty 2001; 
Arbelaez, Maire, Fowlkes et al. 2010). 

Because differentiation is a linear operation, it commutes with other linear filtering oper- 

3 We defer the topic of edge detection in color images. 
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ations. The gradient of the smoothed image can therefore be written as 

Ja(x) = V[G ff ( x) * I(x)} = [VG,](x) * I(x), (4.20) 


i.e., we can convolve the image with the horizontal and vertical derivatives of the Gaussian 
kernel function, 

VG " (X > = <^’^ )(X) = |- S ' - ! ' 1 ? eXp (- £ 2S L ) <421) 


(The parameter er indicates the width of the Gaussian.) This is the same computation that 
is performed by Freeman and Adelson’s (1991) first-order steerable filter, which we already 
covered in Section 3.2.3. 

For many applications, however, we wish to thin such a continuous gradient image to 
only return isolated edges, i.e., as single pixels at discrete locations along the edge contours. 
This can be achieved by looking for maxima in the edge strength (gradient magnitude) in a 
direction perpendicular to the edge orientation, i.e., along the gradient direction. 

Finding this maximum corresponds to taking a directional derivative of the strength field 
in the direction of the gradient and then looking for zero crossings. The desired directional 
derivative is equivalent to the dot product between a second gradient operator and the results 
of the first, 

Sa(x) = V ■ J a {x) = [V 2 G CT ](at) * /(*)]. (4.22) 


The gradient operator dot product with the gradient is called the Laplacian. The convolution 
kernel 


V 2 G ct (x) 



2cr 2 


exp 




(4.23) 


is therefore called the Laplacian of Gaussian (LoG) kernel (Marr and Hildreth 1980). This 
kernel can be split into two separable parts, 


V 2 G ct ( x) 




G rJ (x)G a (y) 



GAy)Ga(x) 


(4.24) 


(Wiejak, Buxton, and Buxton 1985), which allows for a much more efficient implementation 
using separable filtering (Section 3.2.1). 

In practice, it is quite common to replace the Laplacian of Gaussian convolution with a 
Difference of Gaussian (DoG) computation, since the kernel shapes are qualitatively similar 
(Figure 3.35). This is especially convenient if a “Laplacian pyramid” (Section 3.5) has already 
been computed. 4 

4 Recall that Burt and Adelson's (1983a) “Laplacian pyramid” actually computed differences of Gaussian-filtered 
levels. 
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In fact, it is not strictly necessary to take differences between adjacent levels when com- 
puting the edge field. Think about what a zero crossing in a “generalized” difference of 
Gaussians image represents. The finer (smaller kernel) Gaussian is a noise-reduced version 
of the original image. The coarser (larger kernel) Gaussian is an estimate of the average in- 
tensity over a larger region. Thus, whenever the DoG image changes sign, this corresponds 
to the (slightly blurred) image going from relatively darker to relatively lighter, as compared 
to the average intensity in that neighborhood. 

Once we have computed the sign function S(x), we must find its zero crossings and 
convert these into edge elements ( edgels ). An easy way to detect and represent zero crossings 
is to look for adjacent pixel locations Xi and x :) where the sign changes value, i.e., \S(xi) > 
0] ^ [S( Xj ) > 0], 

The sub-pixel location of this crossing can be obtained by computing the “^-intercept” of 
the “line” connecting S(xi) and S( Xj), 


XiS(Xn) — XjS(Xi) 

S( Xj ) - S(xi) 


(4.25) 


The orientation and strength of such edgels can be obtained by linearly interpolating the 
gradient values computed on the original pixel grid. 

An alternative edgel representation can be obtained by linking adjacent edgels on the 
dual grid to form edgels that live inside each square formed by four adjacent pixels in the 
original pixel grid. '' The (potential) advantage of this representation is that the edgels now 
live on a grid offset by half a pixel from the original pixel grid and are thus easier to store 
and access. As before, the orientations and strengths of the edges can be computed by 
interpolating the gradient field or estimating these values from the difference of Gaussian 
image (see Exercise 4.7). 

In applications where the accuracy of the edge orientation is more important, higher-order 
steerable filters can be used (Freeman and Adelson 1991) (see Section 3.2.3). Such filters are 
more selective for more elongated edges and also have the possibility of better modeling curve 
intersections because they can represent multiple orientations at the same pixel (Figure 3.16). 
Their disadvantage is that they are more expensive to compute and the directional derivative 
of the edge strength does not have a simple closed form solution.’ 1 
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Figure 4.32 Scale selection for edge detection (Elder and Zucker 1998) © 1998 IEEE: 
(a) original image; (b-c) Canny/Deriche edge detector tuned to the finer (mannequin) and 
coarser (shadow) scales; (d) minimum reliable scale for gradient estimation; (e) minimum 
reliable scale for second derivative estimation; (f) final detected edges. 


Scale selection and blur estimation 

As we mentioned before, the derivative, Laplacian, and Difference of Gaussian filters (4.20- 
4.23) all require the selection of a spatial scale parameter a. If we are only interested in 
detecting sharp edges, the width of the filter can be determined from image noise characteris- 
tics (Canny 1986; Elder and Zucker 1998). However, if we want to detect edges that occur at 
different resolutions (Figures 4.32b-c), a scale-space approach that detects and then selects 
edges at different scales may be necessary (Witkin 1983; Lindeberg 1994, 1998a; Nielsen, 
Florack, and Deriche 1997). 

Elder and Zucker (1998) present a principled approach to solving this problem. Given 
a known image noise level, their technique computes, for every pixel, the minimum scale 
at which an edge can be reliably detected (Figure 4.32d). Their approach first computes 

5 This algorithm is a 2D version of the 3D marching cubes isosurface extraction algorithm (Lorensen and Cline 
1987). 

6 In fact, the edge orientation can have a 180° ambiguity for “bar edges”, which makes the computation of zero 
crossings in the derivative more tricky. 
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gradients densely over an image by selecting among gradient estimates computed at different 
scales, based on their gradient magnitudes. It then performs a similar estimate of minimum 
scale for directed second derivatives and uses zero crossings of this latter quantity to robustly 
select edges (Figures 4.32e-f). As an optional final step, the blur width of each edge can 
be computed from the distance between extrema in the second derivative response minus the 
width of the Gaussian filter. 

Color edge detection 

While most edge detection techniques have been developed for grayscale images, color im- 
ages can provide additional information. For example, noticeable edges between iso-luminant 
colors (colors that have the same luminance) are useful cues but fail to be detected by grayscale 
edge operators. 

One simple approach is to combine the outputs of grayscale detectors run on each color 
band separately. 7 However, some care must be taken. For example, if we simply sum up 
the gradients in each of the color bands, the signed gradients may actually cancel each other! 
(Consider, for example a pure red-to-green edge.) We could also detect edges independently 
in each band and then take the union of these, but this might lead to thickened or doubled 
edges that are hard to link. 

A better approach is to compute the oriented energy in each band (Morrone and Burr 
1988; Perona and Malik 1990a), e.g., using a second-order steerable filter (Section 3.2.3) 
(Freeman and Adelson 1991), and then sum up the orientation-weighted energies and find 
their joint best orientation. Unfortunately, the directional derivative of this energy may not 
have a closed form solution (as in the case of signed first-order steerable filters), so a simple 
zero crossing-based strategy cannot be used. However, the technique described by Elder and 
Zucker (1998) can be used to compute these zero crossings numerically instead. 

An alternative approach is to estimate local color statistics in regions around each pixel 
(Ruzon and Tomasi 2001; Martin, Fowlkes, and Malik 2004). This has the advantage that 
more sophisticated techniques (e.g., 3D color histograms) can be used to compare regional 
statistics and that additional measures, such as texture, can also be considered. Figure 4.33 
shows the output of such detectors. 

Of course, many other approaches have been developed for detecting color edges, dating 
back to early work by Nevada (1977). Ruzon and Tomasi (2001) and Gevers, van de Weijer, 
and Stokman (2006) provide good reviews of these approaches, which include ideas such as 
fusing outputs from multiple channels, using multidimensional gradients, and vector-based 

7 Instead of using the raw RGB space, a more perceptually uniform color space such as L*a*b* (see Section 2.3.2) 
can be used instead. When trying to match human performance (Martin, Fowlkes, and Malik 2004), this makes sense. 
However, in terms of the physics of the underlying image formation and sensing, it may be a questionable strategy. 
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methods. 

Combining edge feature cues 

If the goal of edge detection is to match human boundary detection performance (Bowyer, 
Kranenburg, and Dougherty 2001; Martin, Fowlkes, and Malik 2004; Arbelaez, Maire, Fowlkes 
et al. 2010), as opposed to simply finding stable features for matching, even better detectors 
can be constructed by combining multiple low-level cues such as brightness, color, and tex- 
ture. 

Martin, Fowlkes, and Malik (2004) describe a system that combines brightness, color, and 
texture edges to produce state-of-the-art performance on a database of hand-segmented natu- 
ral color images (Martin, Fowlkes, Tal et al. 2001). First, they construct and train separate 
oriented half-disc detectors for measuring significant differences in brightness (luminance), 
color (a* and b* channels, summed responses), and texture (un-normalized filter bank re- 
sponses from the work of Malik, Belongie, Leung et al. (2001)). Some of the responses 
are then sharpened using a soft non-maximal suppression technique. Finally, the outputs of 
the three detectors are combined using a variety of machine -learning techniques, from which 
logistic regression is found to have the best tradeoff between speed, space and accuracy . 
The resulting system (see Figure 4.33 for some examples) is shown to outperform previously 
developed techniques. Maire, Arbelaez, Fowlkes et al. (2008) improve on these results by 
combining the detector based on local appearance with a spectral (segmentation-based) de- 
tector (Belongie and Malik 1998). In more recent work, Arbelaez, Maire, Fowlkes et al. 
(2010) build a hierarchical segmentation on top of this edge detector using a variant of the 
watershed algorithm. 


4.2.2 Edge linking 

While isolated edges can be useful for a variety of applications, such as line detection (Sec- 
tion 4.3) and sparse stereo matching (Section 11.2), they become even more useful when 
linked into continuous contours. 

If the edges have been detected using zero crossings of some function, linking them up 
is straightforward, since adjacent edgels share common endpoints. Linking the edgels into 
chains involves picking up an unlinked edgel and following its neighbors in both directions. 
Either a sorted list of edgels (sorted first by x coordinates and then by y coordinates, for 
example) or a 2D array can be used to accelerate the neighbor finding. If edges were not 
detected using zero crossings, finding the continuation of an edgel can be tricky. In this 
case, comparing the orientation (and, optionally, phase) of adjacent edgels can be used for 


The training uses 200 labeled images and testing is performed on a different set of 100 images. 
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Figure 4.33 Combined brightness, color, texture boundary detector (Martin, Fowlkes, and 
Malik 2004) © 2004 IEEE. Successive rows show the outputs of the brightness gradient 
(BG), color gradient (CG), texture gradient (TG), and combined (BG+CG+TG) detectors. 
The final row shows human-labeled boundaries derived from a database of hand-segmented 
images (Martin, Fowlkes, Tal el al. 2001). 
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Figure 4.34 Chain code representation of a grid-aligned linked edge chain. The code is 
represented as a series of direction codes, e.g, 0 1 0 7 6 5, which can further be compressed 
using predictive and run-length coding. 


disambiguation. Ideas from connected component computation can also sometimes be used 
to make the edge linking process even faster (see Exercise 4.8). 

Once the edgels have been linked into chains, we can apply an optional thresholding 
with hysteresis to remove low-strength contour segments (Canny 1986). The basic idea of 
hysteresis is to set two different thresholds and allow a curve being tracked above the higher 
threshold to dip in strength down to the lower threshold. 

Linked edgel lists can be encoded more compactly using a variety of alternative repre- 
sentations. A chain code encodes a list of connected points lying on an J\fg grid using a 
three-bit code corresponding to the eight cardinal directions (N, NE, E, SE, S, SW, W, NW) 
between a point and its successor (Figure 4.34). While this representation is more compact 
than the original edgel list (especially if predictive variable-length coding is used), it is not 
very suitable for further processing. 

A more useful representation is the arc length parameterization of a contour, x(s), where 
s denotes the arc length along a curve. Consider the linked set of edgels shown in Fig- 
ure 4.35a. We start at one point (the dot at (1.0, 0.5) in Figure 4.35a) and plot it at coordinate 
s = 0 (Figure 4.35b). The next point at (2.0, 0.5) gets plotted at s = 1, and the next point 
at (2.5, 1.0) gets plotted at s = 1.7071, i.e., we increment s by the length of each edge seg- 
ment. The resulting plot can be resampled on a regular (say, integral) s grid before further 
processing. 

The advantage of the arc-length parameterization is that it makes matching and processing 
(e.g., smoothing) operations much easier. Consider the two curves describing similar shapes 
shown in Figure 4.36. To compare the curves, we first subtract the average values Xq = 
f s x(s) from each descriptor. Next, we rescale each descriptor so that s goes from 0 to 1 
instead of 0 to S, i.e., we divide x(.s) by S. Finally, we take the Fourier transform of each 
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(a) (b) 


Figure 4.35 Arc-length parameterization of a contour: (a) discrete points along the contour 
are first transcribed as (b) (x, y) pairs along the arc length s. This curve can then be regularly 
re-sampled or converted into alternative (e.g., Fourier) representations. 



Figure 4.36 Matching two contours using their arc -length parameterization. If both curves 
are normalized to unit length, s £ [0, 1] and centered around their centroid x (h they will 
have the same descriptor up to an overall “temporal” shift (due to different starting points for 
s = 0) and a phase ( x-y ) shift (due to rotation). 
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Figure 4.37 Curve smoothing with a Gaussian kernel (Lowe 1988) © 1998 IEEE: (a) with- 
out a shrinkage correction term; (b) with a shrinkage correction term. 



Figure 4.38 Changing the character of a curve without affecting its sweep (Finkelstein and 
Salesin 1994) © 1994 ACM: higher frequency wavelets can be replaced with exemplars from 
a style library to effect different local appearances. 


normalized descriptor, treating each x = (x, y) value as a complex number. If the original 
curves are the same (up to an unknown scale and rotation), the resulting Fourier transforms 
should differ only by a scale change in magnitude plus a constant complex phase shift, due 
to rotation, and a linear phase shift in the domain, due to different starting points for s (see 
Exercise 4.9). 

Arc-length parameterization can also be used to smooth curves in order to remove digiti- 
zation noise. However, if we just apply a regular smoothing filter, the curve tends to shrink 
on itself (Figure 4.37a). Lowe (1989) and Taubin (1995) describe techniques that compensate 
for this shrinkage by adding an offset term based on second derivative estimates or a larger 
smoothing kernel (Figure 4.37b). An alternative approach, based on selectively modifying 
different frequencies in a wavelet decomposition, is presented by Finkelstein and Salesin 
(1994). In addition to controlling shrinkage without affecting its “sweep”, wavelets allow the 
“character” of a curve to be interactively modified, as shown in Figure 4.38. 

The evolution of curves as they are smoothed and simplified is related to “grassfire” (dis- 
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Figure 4.39 Image editing in the contour domain (Elder and Goldberg 2001) © 2001 IEEE: 
(a) and (d) original images; (b) and (e) extracted edges (edges to be deleted are marked in 
white); (c) and (f) reconstructed edited images. 


tance) transforms and region skeletons (Section 3.3.3) (Tek and Kimia 2003), and can be used 
to recognize objects based on their contour shape (Sebastian and Kimia 2005). More local de- 
scriptors of curve shape such as shape contexts (Belong© Malik, and Puzicha 2002) can also 
be used for recognition and are potentially more robust to missing parts due to occlusions. 

The field of contour detection and linking continues to evolve rapidly and now includes 
techniques for global contour grouping, boundary completion, and junction detection (Maire, 
Arbelaez, Fowlkes el al. 2008), as well as grouping contours into likely regions (Arbelaez, 
Maire, Fowlkes et al. 2010) and wide-baseline correspondence (Meltzer and Soatto 2008). 


4.2.3 Application : Edge editing and enhancement 

While edges can serve as components for object recognition or features for matching, they 
can also be used directly for image editing. 

In fact, if the edge magnitude and blur estimate are kept along with each edge, a visually 
similar image can be reconstructed from this information (Elder 1999). Based on this princi- 
ple, Elder and Goldberg (2001) propose a system for “image editing in the contour domain”. 
Their system allows users to selectively remove edges corresponding to unwanted features 
such as specularities, shadows, or distracting visual elements. After reconstructing the image 
from the remaining edges, the undesirable visual features have been removed (Figure 4.39). 
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(a) 



Figure 4.40 Approximating a curve (shown in black) as a polyline or B-spline: (a) original 
curve and a polyline approximation shown in red; (b) successive approximation by recursively 
finding points furthest away from the current approximation; (c) smooth interpolating spline, 
shown in dark blue, fit to the polyline vertices. 

Another potential application is to enhance perceptually salient edges while simplifying 
the underlying image to produce a cartoon-like or “pen-and-ink” stylized image (DeCarlo and 
Santella 2002). This application is discussed in more detail in Section 10.5.2. 


While edges and general curves are suitable for describing the contours of natural objects, 
the man-made world is full of straight lines. Detecting and matching these lines can be 
useful in a variety of applications, including architectural modeling, pose estimation in urban 
environments, and the analysis of printed document layouts. 

In this section, we present some techniques for extracting piecewise linear descriptions 
from the curves computed in the previous section. We begin with some algorithms for approx- 
imating a curve as a piecewise-linear polyline. We then describe the Hough transform, which 
can be used to group edgels into line segments even across gaps and occlusions. Finally, we 
describe how 3D lines with common vanishing points can be grouped together. These van- 
ishing points can be used to calibrate a camera and to determine its orientation relative to a 
rectahedral scene, as described in Section 6.3.2. 

4.3.1 Successive approximation 

As we saw in Section 4.2.2, describing a curve as a series of 2D locations x t = x(si) provides 
a general representation suitable for matching and further processing. In many applications, 
however, it is preferable to approximate such a curve with a simpler representation, e.g., as a 
piecewise-linear polyline or as a B-spline curve (Farin 1996), as shown in Figure 4.40. 

Many techniques have been developed over the years to perform this approximation, 
which is also known as line simplification. One of the oldest, and simplest, is the one proposed 
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Figure 4.41 Original Hough transform: (a) each point votes for a complete family of poten- 
tial lines Ti(d) = Xi cos 6 + yi sin 6\ (b) each pencil of lines sweeps out a sinusoid in (r, 9)\ 
their intersection provides the desired line equation. 


by Ramer (1972) and Douglas and Peucker (1973), who recursively subdivide the curve at 
the point furthest away from the line joining the two endpoints (or the current coarse polyline 
approximation), as shown in Figure 4.40. Hershberger and Snoeyink (1992) provide a more 
efficient implementation and also cite some of the other related work in this area. 

Once the line simplification has been computed, it can be used to approximate the orig- 
inal curve. If a smoother representation or visualization is desired, either approximating or 
interpolating splines or curves can be used (Sections 3.5.1 and 5.1.1) (Szeliski and Ito 1986; 
Bartels, Beatty, and Barsky 1987; Farin 1996), as shown in Figure 4.40c. 


4.3.2 Hough transforms 

While curve approximation with polylines can often lead to successful line extraction, lines 
in the real world are sometimes broken up into disconnected components or made up of many 
collinear line segments. In many cases, it is desirable to group such collinear segments into 
extended lines. At a further processing stage (described in Section 4.3.3), we can then group 
such lines into collections with common vanishing points. 

The Hough transform, named after its original inventor (Hough 1962), is a well-known 
technique for having edges “vote” for plausible line locations (Duda and Hart 1972; Ballard 
1981; Illingworth and Kittler 1988). In its original formulation (Figure 4.41), each edge point 
votes for all possible lines passing through it, and lines corresponding to high accumulator or 
bin values are examined for potential line fits. 9 Unless the points on a line are truly punctate, 
a better approach (in my experience) is to use the local orientation information at each edgel 
to vote for a single accumulator cell (Figure 4.42), as described below. A hybrid strategy, 

9 The Hough transform can also be generalized to look for other geometric features such as circles (Ballard 
1981), but we do not cover such extensions in this book. 
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Figure 4.42 Oriented Hough transform: (a) an edgel re-parameterized in polar (r, 9) coor- 
dinates, with hi = (cos Oi, sin 9i) and = hi ■ xf, (b) (r, 9) accumulator array, showing the 
votes for the three edgels marked in red, green, and blue. 



Figure 4.43 2D line equation expressed in terms of the normal h and distance to the origin 
d. 

where each edgel votes for a number of possible orientation or location pairs centered around 
the estimate orientation, may be desirable in some cases. 

Before we can vote for line hypotheses, we must first choose a suitable representation. 
Figure 4.43 (copied from Figure 2.2a) shows the normal-distance (n, d) parameterization for 
a line. Since lines are made up of edge segments, we adopt the convention that the line normal 
h points in the same direction (i.e., has the same sign) as the image gradient J(x) = V 1 (x) 
(4.19). To obtain a minimal two-parameter representation for lines, we convert the normal 
vector into an angle 

9 = tan -1 n y /n x , (4.26) 

as shown in Figure 4.43. The range of possible ( 9 , d) values is [—180°, 180°] x [ — \/2, \/2], 
assuming that we are using normalized pixel coordinates (2.61) that lie in [—1, 1]. The number 
of bins to use along each axis depends on the accuracy of the position and orientation estimate 
available at each edgel and the expected line density, and is best set experimentally with some 
test runs on sample imagery. 

Given the line parameterization, the Hough transform proceeds as shown in Algorithm 4.2. 
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procedure Hough{{{ x, y, 0)}): 

1 . Clear the accumulator array. 

2. For each detected edgel at location (x, y) and orientation 9 = tan -1 n y /n x , 
compute the value of 

d = x n x + y n y 

and increment the accumulator corresponding to (9, d). 

3. Find the peaks in the accumulator corresponding to lines. 

4. Optionally re-fit the lines to the constituent edgels. 


Algorithm 4.2 Outline of a Hough transform algorithm based on oriented edge segments. 

Note that the original formulation of the Hough transform, which assumed no knowledge of 
the edgel orientation 9 , has an additional loop inside Step 2 that iterates over all possible 
values of 6 and increments a whole series of accumulators. 

There are a lot of details in getting the Hough transform to work well, but these are 
best worked out by writing an implementation and testing it out on sample data. Exercise 
4.12 describes some of these steps in more detail, including using edge segment lengths or 
strengths during the voting process, keeping a list of constituent edgels in the accumulator 
array for easier post-processing, and optionally combining edges of different “polarity” into 
the same line segments. 

An alternative to the 2D polar (0. d) representation for lines is to use the full 3D m = 
(n, d) line equation, projected onto the unit sphere. While the sphere can be parameterized 
using spherical coordinates (2.8), 

rh = (cos 9 cos <j>, sin 9 cos <j), sin </>) , (4.27) 

this does not uniformly sample the sphere and still requires the use of trigonometry. 

An alternative representation can be obtained by using a cube map , i.e., projecting m onto 
the face of a unit cube (Figure 4.44a). To compute the cube map coordinate of a 3D vector 
m, first find the largest (absolute value) component of m, i.e., m = ±max(|n x |, \n y \, |d|), 
and use this to select one of the six cube faces. Divide the remaining two coordinates by m 
and use these as indices into the cube face. While this avoids the use of trigonometry, it does 
require some decision logic. 

One advantage of using the cube map, first pointed out by Tuytelaars, Van Gool, and 
Proesmans (1997), is that all of the lines passing through a point correspond to line segments 
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Figure 4.44 Cube map representation for line equations and vanishing points: (a) a cube map 
surrounding the unit sphere; (b) projecting the half-cube onto three subspaces (Tuytelaars, 
Van Gool, and Proesmans 1997) © 1997 IEEE. 


on the cube faces, which is useful if the original (full voting) variant of the Hough transform 
is being used. In their work, they represent the line equation as ax + b + y = 0, which 
does not treat the x and y axes symmetrically. Note that if we restrict d > 0 by ignoring the 
polarity of the edge orientation (gradient sign), we can use a half-cube instead, which can be 
represented using only three cube faces, as shown in Figure 4.44b (Tuytelaars, Van Gool, and 
Proesmans 1997). 

RANSAC-based line detection. Another alternative to the Hough transform is the RAN- 
dom S Ample Consensus (RANSAC) algorithm described in more detail in Section 6.1.4. In 
brief, RANSAC randomly chooses pairs of edgels to form a line hypothesis and then tests 
how many other edgels fall onto this line. (If the edge orientations are accurate enough, a 
single edgel can produce this hypothesis.) Lines with sufficiently large numbers of inkers 
(matching edgels) are then selected as the desired line segments. 

An advantage of RANSAC is that no accumulator array is needed and so the algorithm can 
be more space efficient and potentially less prone to the choice of bin size. The disadvantage 
is that many more hypotheses may need to be generated and tested than those obtained by 
finding peaks in the accumulator array. 

In general, there is no clear consensus on which line estimation technique performs best. 
It is therefore a good idea to think carefully about the problem at hand and to implement 
several approaches (successive approximation. Hough, and RANSAC) to determine the one 
that works best for your application. 

4.3.3 Vanishing points 

In many scenes, structurally important lines have the same vanishing point because they are 
parallel in 3D. Examples of such lines are horizontal and vertical building edges, zebra cross- 
ings, railway tracks, the edges of furniture such as tables and dressers, and of course, the 
ubiquitous calibration pattern (Figure 4.45). Finding the vanishing points common to such 
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Figure 4.45 Real-world vanishing points: (a) architecture (Sinha, Steedly, Szeliski et al. 
2008), (b) furniture (Micusik, Wildenauer, and Kosecka 2008) © 2008 IEEE, and (c) cali- 
bration patterns (Zhang 2000). 


line sets can help refine their position in the image and, in certain cases, help determine the 
intrinsic and extrinsic orientation of the camera (Section 6.3.2). 

Over the years, a large number of techniques have been developed for finding vanishing 
points, including (Quan and Mohr 1989; Collins and Weiss 1990; Brillaut-O’ Mahoney 1991; 
McLean and Kotturi 1995; Becker and Bove 1995; Shufelt 1999; Tuytelaars, Van Gool, and 
Proesmans 1997; Schaffalitzky and Zisserman 2000; Antone and Teller 2002; Rother 2002; 
Kosecka and Zhang 2005; Pflugfelder 2008; Tardif 2009) — see some of the more recent pa- 
pers for additional references. In this section, we present a simple Hough technique based 
on having line pairs vote for potential vanishing point locations, followed by a robust least 
squares fitting stage. For alternative approaches, please see some of the more recent papers 
listed above. 

The first stage in my vanishing point detection algorithm uses a Hough transform to accu- 
mulate votes for likely vanishing point candidates. As with line fitting, one possible approach 
is to have each line vote for all possible vanishing point directions, either using a cube map 
(Tuytelaars, Van Gool, and Proesmans 1997; Antone and Teller 2002) or a Gaussian sphere 
(Collins and Weiss 1990), optionally using knowledge about the uncertainty in the vanish- 
ing point location to perform a weighted vote (Collins and Weiss 1990; Brillaut-O’ Mahoney 
1991; Shufelt 1999). My preferred approach is to use pairs of detected line segments to form 
candidate vanishing point locations. Let fhi and fhj be the (unit norm) line equations for a 
pair of line segments and k and l :1 be their corresponding segment lengths. The location of 
the corresponding vanishing point hypothesis can be computed as 

v,j = rhi x rhj (4.28) 


and the corresponding weight set to 


U-'ij — I 1 1 ■ 


(4.29) 
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Figure 4.46 Triple product of the line segments endpoints p i0 and p a and the vanishing 
point v. The area A is proportional to the perpendicular distance d\ and the distance between 
the other endpoint p i0 and the vanishing point. 

This has the desirable effect of downweighting (near-)collinear line segments and short line 
segments. The Hough space itself can either be represented using spherical coordinates (4.27) 
or as a cube map (Figure 4.44a). 

Once the Hough accumulator space has been populated, peaks can be detected in a manner 
similar to that previously discussed for line detection. Given a set of candidate line segments 
that voted for a vanishing point, which can optionally be kept as a list at each Hough accu- 
mulator cell, I then use a robust least squares fit to estimate a more accurate location for each 
vanishing point. 

Consider the relationship between the two line segment endpoints {p i0 . p u } and the van- 
ishing point v, as shown in Figure 4.46. The area A of the triangle given by these three points, 
which is the magnitude of their triple product 


is proportional to the perpendicular distance d\ between each endpoint and the line through 
v and the other endpoint, as well as the distance between p i0 and v. Assuming that the 
accuracy of a fitted line segment is proportional to its endpoint accuracy (Exercise 4.13), this 
therefore serves as an optimal metric for how well a vanishing point fits a set of extracted 
lines (Leibowitz (2001, Section 3.6.1) and Pflugfelder (2008, Section 2. 1.1. 3)). A robustified 
least squares estimate (Appendix B.3) for the vanishing point can therefore be written as 


where rn, = p iQ x p lA is the segment line equation weighted by its length and w, = 
p'(Ai)/Ai is the influence of each robustified (reweighted) measurement on the final error 


wise weighted Hough transform accumulation step. The final desired value for v is computed 
as the least eigenvector of M. 


A = \(p i0 XPil ) -v\, 


(4.30) 



(4.31) 


(Appendix B.3). Notice how this metric is closely related to the original formula for the pair- 
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While the technique described above proceeds in two discrete stages, better results may 
be obtained by alternating between assigning lines to vanishing points and refitting the van- 
ishing point locations (Antone and Teller 2002; Kosecka and Zhang 2005; Pflugfelder 2008). 
The results of detecting individual vanishing points can also be made more robust by simulta- 
neously searching for pairs or triplets of mutually orthogonal vanishing points (Shufelt 1999; 
Antone and Teller 2002; Rother 2002; Sinha, Steedly, Szeliski et al. 2008). Some results of 
such vanishing point detection algorithms can be seen in Figure 4.45. 

4.3.4 Application : Rectangle detection 

Once sets of mutually orthogonal vanishing points have been detected, it now becomes pos- 
sible to search for 3D rectangular structures in the image (Figure 4.47). Over the last decade, 
a variety of techniques have been developed to find such rectangles, primarily focused on 
architectural scenes (Kosecka and Zhang 2005; Han and Zhu 2005; Shaw and Barnes 2006; 
Micusik, Wildenauer, and Kosecka 2008; Schindler, Krishnamurthy, Lublinerman et al. 2008). 

After detecting orthogonal vanishing directions, Kosecka and Zhang (2005) refine the 
fitted line equations, search for corners near line intersections, and then verify rectangle hy- 
potheses by rectifying the corresponding patches and looking for a preponderance of hori- 
zontal and vertical edges (Figures 4.47a-b). In follow-on work, Micusik, Wildenauer, and 
Kosecka (2008) use a Markov random field (MRF) to disambiguate between potentially over- 
lapping rectangle hypotheses. They also use a plane sweep algorithm to match rectangles 
between different views (Figures 4.47d-f). 

A different approach is proposed by Han and Zhu (2005), who use a grammar of potential 
rectangle shapes and nesting structures (between rectangles and vanishing points) to infer the 
most likely assignment of line segments to rectangles (Figure 4.47c). 

4.4 Additional reading 

One of the seminal papers on feature detection, description, and matching is by Lowe (2004). 
Comprehensive surveys and evaluations of such techniques have been made by Schmid, 
Mohr, and Bauckhage (2000); Mikolajczyk and Schmid (2005); Mikolajczyk, Tuytelaars, 
Schmid et al. (2005); Tuytelaars and Mikolajczyk (2007) while Shi and Tomasi (1994) and 
Triggs (2004) also provide nice reviews. 

In the area of feature detectors (Mikolajczyk, Tuytelaars, Schmid et al. 2005), in addition 
to such classic approaches as Forstner-Harris (Forstner 1986; Harris and Stephens 1988) and 
difference of Gaussians (Lindeberg 1993, 1998b; Lowe 2004), maximally stable extremal re- 
gions (MSERs) are widely used for applications that require affine invariance (Matas, Chum, 
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Figure 4.47 Rectangle detection: (a) indoor corridor and (b) building exterior with grouped 
facades (Kosecka and Zhang 2005) © 2005 Elsevier; (c) grammar-based recognition (Han 
and Zhu 2005) © 2005 IEEE; (d-f) rectangle matching using a plane sweep algorithm 
(Micusik, Wildenauer, and Kosecka 2008) © 2008 IEEE. 


Urban et al. 2004; Nister and Stewenius 2008). More recent interest point detectors are 
discussed by Xiao and Shah (2003); Koethe (2003); Carneiro and Jepson (2005); Kenney, 
Zuliani, and Manjunath (2005); Bay, Tuytelaars, and Van Gool (2006); Platel, Balmachnova, 
Florack el al. (2006); Rosten and Drummond (2006), as well as techniques based on line 
matching (Zoghlami, Faugeras, and Deriche 1997; Bartoli, Coquerelle, and Sturm 2004) and 
region detection (Kadir, Zisserman, and Brady 2004; Matas, Chum, Urban el al. 2004; Tuyte- 
laars and Van Gool 2004; Corso and Hager 2005). 

A variety of local feature descriptors (and matching heuristics) are surveyed and com- 
pared by Mikolajczyk and Schmid (2005). More recent publications in this area include 
those by van de Weijer and Schmid (2006); Abdel-Hakim and Farag (2006); Winder and 
Brown (2007); Hua, Brown, and Winder (2007). Techniques for efficiently matching features 
include k-d trees (Beis and Lowe 1999; Lowe 2004; Muja and Lowe 2009), pyramid match- 
ing kernels (Grauman and Darrell 2005), metric (vocabulary) trees (Nister and Stewenius 
2006), and a variety of multi-dimensional hashing techniques (Shakhnarovich, Viola, and 
Darrell 2003; Torralba, Weiss, and Fergus 2008; Weiss, Torralba, and Fergus 2008; Kulis and 
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Grauman 2009; Raginsky and Lazebnik 2009). 

The classic reference on feature detection and tracking is (Shi and Tomasi 1994). More 
recent work in this field has focused on learning better matching functions for specific features 
(Avidan 2001; June and Dhome 2002; Williams, Blake, and Cipolla 2003; Lepetit and Fua 
2005; Lepetit, Pilet, and Fua 2006; Hinterstoisser, Benhimane, Navab et al. 2008; Rogez, 
Rihan, Ramalingam el al. 2008; Ozuysal, Calonder, Lepetit et al. 2010). 

A highly cited and widely used edge detector is the one developed by Canny (1986). 
Alternative edge detectors as well as experimental comparisons can be found in publica- 
tions by Nalwa and Binford (1986); Nalwa (1987); Deriche (1987); Freeman and Adelson 
(1991); Nalwa (1993); Heath, Sarkar, Sanocki et al. (1998); Crane (1997); Ritter and Wilson 
(2000); Bowyer, Kranenburg, and Dougherty (2001); Arbelaez, Maire, Fowlkes et al. (2010). 
The topic of scale selection in edge detection is nicely treated by Elder and Zucker (1998), 
while approaches to color and texture edge detection can be found in (Ruzon and Tomasi 
2001; Martin, Fowlkes, and Malik 2004; Gevers, van de Weijer, and Stokman 2006). Edge 
detectors have also recently been combined with region segmentation techniques to further 
improve the detection of semantically salient boundaries (Maire, Arbelaez, Fowlkes et al. 
2008; Arbelaez, Maire, Fowlkes et al. 2010). Edges linked into contours can be smoothed 
and manipulated for artistic effect (Lowe 1989; Finkelstein and Salesin 1994; Taubin 1995) 
and used for recognition (Belongie, Malik, and Puzicha 2002; Tek and Kimia 2003; Sebastian 
and Kimia 2005). 

An early, well-regarded paper on straight line extraction in images was written by Burns, 
Hanson, and Riseman (1986). More recent techniques often combine line detection with van- 
ishing point detection (Quan and Mohr 1989; Collins and Weiss 1990; Brillaut-O’ Mahoney 
1991; McLean and Kotturi 1995; Becker and Bove 1995; Shufelt 1999; Tuytelaars, Van Gool, 
and Proesmans 1997; Schaffalitzky and Zisserman 2000; Antone and Teller 2002; Rother 
2002; Kosecka and Zhang 2005; Pflugfelder 2008; Sinha, Steedly, Szeliski et al. 2008; Tardif 
2009). 

4.5 Exercises 

Ex 4.1: Interest point detector Implement one or more keypoint detectors and compare 
their performance (with your own or with a classmate’s detector). 

Possible detectors: 

• Laplacian or Difference of Gaussian; 

• Forstner-Harris Hessian (try different formula variants given in (4.9-4. 1 1)); 
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• oriented/steerable filter, looking for either second-order high second response or two 
edges in a window (Koethe 2003), as discussed in Section 4.1.1. 

Other detectors are described by Mikolajczyk, Tuytelaars, Schmid el al. (2005); Tuytelaars 
and Mikolajczyk (2007). Additional optional steps could include: 

1 . Compute the detections on a sub-octave pyramid and find 3D maxima. 

2. Find local orientation estimates using steerable filter responses or a gradient histogram- 
ming method. 

3. Implement non-maximal suppression, such as the adaptive technique of Brown, Szeliski, 
and Winder (2005). 

4. Vary the window shape and size (pre-filter and aggregation). 

To test for repeatability, download the code from http://www.robots.ox.ac.uk/~vgg/research/ 
affine/ (Mikolajczyk, Tuytelaars, Schmid et al. 2005; Tuytelaars and Mikolajczyk 2007) or 
simply rotate or shear your own test images. (Pick a domain you may want to use later, e.g., 
for outdoor stitching.) 

Be sure to measure and report the stability of your scale and orientation estimates. 

Ex 4.2: Interest point descriptor Implement one or more descriptors (steered to local scale 
and orientation) and compare their performance (with your own or with a classmate’s detec- 
tor). 

Some possible descriptors include 

• contrast-normalized patches (Brown, Szeliski, and Winder 2005); 

• SIFT (Lowe 2004); 

• GLOH (Mikolajczyk and Schmid 2005); 

• DAISY (Winder and Brown 2007; Tola, Lepetit, and Fua 2010). 

Other detectors are described by Mikolajczyk and Schmid (2005). 

Ex 4.3: ROC curve computation Given a pair of curves (histograms) plotting the number 
of matching and non-matching features as a function of Euclidean distance d as shown in 
Figure 4.23b, derive an algorithm for plotting a ROC curve (Figure 4.23a). In particular, let 
t(d) be the distribution of true matches and f(d) be the distribution of (false) non-matches. 
Write down the equations for the ROC, i.e., TPR(FPR), and the AUC. 

(Hint: Plot the cumulative distributions T(d) = J t(d) and F(d) = J f(d) and see if 
these help you derive the TPR and FPR at a given threshold 9.) 
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Ex 4.4: Feature matcher After extracting features from a collection of overlapping or dis- 
torted images, 10 match them up by their descriptors either using nearest neighbor matching 
or a more efficient matching strategy such as a k-d tree. 

See whether you can improve the accuracy of your matches using techniques such as the 
nearest neighbor distance ratio. 

Ex 4.5: Feature tracker Instead of finding feature points independently in multiple images 
and then matching them, find features in the first image of a video or image sequence and 
then re-locate the corresponding points in the next frames using either search and gradient 
descent (Shi and Tomasi 1994) or learned feature detectors (Lepetit, Pilet, and Fua 2006; 
Fossati, Dimitrijevic, Lepetit el al. 2007). When the number of tracked points drops below a 
threshold or new regions in the image become visible, find additional points to track. 

(Optional) Winnow out incorrect matches by estimating a homography (6.19-6.23) or 
fundamental matrix (Section 7.2.1). 

(Optional) Refine the accuracy of your matches using the iterative registration algorithm 
described in Section 8.2 and Exercise 8.2. 

Ex 4.6: Facial feature tracker Apply your feature tracker to tracking points on a person’s 
face, either manually initialized to interesting locations such as eye corners or automatically 
initialized at interest points. 

(Optional) Match features between two people and use these features to perform image 
morphing (Exercise 3.25). 

Ex 4.7: Edge detector Implement an edge detector of your choice. Compare its perfor- 
mance to that of your classmates’ detectors or code downloaded from the Internet. 

A simple but well-performing sub-pixel edge detector can be created as follows: 

1 . Blur the input image a little. 


B a (x) = G a {x) * I(x). 

2. Construct a Gaussian pyramid (Exercise 3.19), 

P = Pyramid{S (T (a:)} 

3. Subtract an interpolated coarser-level pyramid image from the original resolution blurred 
image, 

S(x) = B a (x) — P.InterpolatedLevel(L). 

10 http://www.robots.ox.ac.uk/~vgg/research/afiine/. 
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struct SEdgel { 


float e [2] [2] ; 
float x, y; 
float n_x, n_y; 
float theta; 
float length; 
float strength; 


// edgel endpoints (zero crossing) 

// sub-pixel edge position (midpoint) 

/ / orientation, as normal vector 
// orientation, as angle (degrees) 

// length of edgel 

// strength of edgel (gradient magnitude) 


struct SLine : public SEdgel { 

float line_length; // length of line (est. from ellipsoid) 


Figure 4.48 A potential C++ structure for edgel and line elements. 

4. For each quad of pixels, {(*, j), (i + 1, j), (i,j + 1), (i + 1, j + 1)}, count the number 
of zero crossings along the four edges. 

5. When there are exactly two zero crossings, compute their locations using (4.25) and 
store these edgel endpoints along with the midpoint in the edgel structure (Figure 4.48). 

6. For each edgel, compute the local gradient by taking the horizontal and vertical differ- 
ences between the values of S along the zero crossing edges. 

7. Store the magnitude of this gradient as the edge strength and either its orientation or 
that of the segment joining the edgel endpoints as the edge orientation. 

8. Add the edgel to a list of edgels or store it in a 2D array of edgels (addressed by pixel 
coordinates). 

Figure 4.48 shows a possible representation for each computed edgel. 

Ex 4.8: Edge linking and thresholding Link up the edges computed in the previous exer- 
cise into chains and optionally perform thresholding with hysteresis. 

The steps may include: 

1. Store the edgels either in a 2D array (say, an integer image with indices into the edgel 
list) or pre-sort the edgel list first by (integer) x coordinates and then y coordinates, for 
faster neighbor finding. 


float sigma; 
float r; 


// estimated std. dev. of edgel noise 
// line equation: x * n_y - y * n_x = r 


}; 
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2. Pick up an edgel from the list of unlinked edgels and find its neighbors in both direc- 
tions until no neighbor is found or a closed contour is obtained. Flag edgels as linked 
as you visit them and push them onto your list of linked edgels. 

3. Alternatively, generalize a previously developed connected component algorithm (Ex- 
ercise 3.14) to perform the linking in just two raster passes. 

4. (Optional) Perform hysteresis-based thresholding (Canny 1986). Use two thresholds 
”hi” and ”lo” for the edge strength. A candidate edgel is considered an edge if either 
its strength is above the ”hi” threshold or its strength is above the ”lo” threshold and it 
is (recursively) connected to a previously detected edge. 

5. (Optional) Link together contours that have small gaps but whose endpoints have sim- 
ilar orientations. 

6. (Optional) Find junctions between adjacent contours, e.g., using some of the ideas (or 
references) from Maire, Arbelaez, Fowlkes et al. (2008). 

Ex 4.9: Contour matching Convert a closed contour (linked edgel list) into its arc-length 
parameterization and use this to match object outlines. 

The steps may include: 

1. Walk along the contour and create a list of ( Xi,yi,Si ) triplets, using the arc-length 
formula 

s i+ i = Si + ||® j+ i - Si||. (4.32) 

2. Resample this list onto a regular set of (xj . y.j . j) samples using linear interpolation of 
each segment. 

3. Compute the average values of x and y, i.e., x and y and subtract them from your 
sampled curve points. 

4. Resample the original Sj ) piecewise-linear function onto a length-independent 

set of samples, say j £ [0, 1023]. (Using a length which is a power of two makes 
subsequent Fourier transforms more convenient.) 

5. Compute the Fourier transform of the curve, treating each (x,y) pair as a complex 
number. 

6. To compare two curves, fit a linear equation to the phase difference between the two 
curves. (Careful: phase wraps around at 360°. Also, you may wish to weight samples 
by their Fourier spectrum magnitude — see Section 8.1.2.) 
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7. (Optional) Prove that the constant phase component corresponds to the temporal shift 
in s, while the linear component corresponds to rotation. 

Of course, feel free to try any other curve descriptor and matching technique from the com- 
puter vision literature (Tek and Kimia 2003; Sebastian and Kimia 2005). 

Ex 4.10: Jigsaw puzzle solver — challenging Write a program to automatically solve a jig- 
saw puzzle from a set of scanned puzzle pieces. Your software may include the following 
components: 

1. Scan the pieces (either face up or face down) on a flatbed scanner with a distinctively 
colored background. 

2. (Optional) Scan in the box top to use as a low-resolution reference image. 

3. Use color-based thresholding to isolate the pieces. 

4. Extract the contour of each piece using edge finding and linking. 

5. (Optional) Re-represent each contour using an arc-length or some other re-parameterization. 
Break up the contours into meaningful matchable pieces. (Is this hard?) 

6. (Optional) Associate color values with each contour to help in the matching. 

7. (Optional) Match pieces to the reference image using some rotationally invariant fea- 
ture descriptors. 

8. Solve a global optimization or (backtracking) search problem to snap pieces together 
and place them in the correct location relative to the reference image. 

9. Test your algorithm on a succession of more difficult puzzles and compare your results 
with those of others. 

Ex 4.11: Successive approximation line detector Implement a line simplification algorithm 
(Section 4.3.1) (Ramer 1972; Douglas and Peucker 1973) to convert a hand-drawn curve (or 
linked edge image) into a small set of polylines. 

(Optional) Re-render this curve using either an approximating or interpolating spline or 
Bezier curve (Szeliski and Ito 1986; Bartels, Beatty, and Barsky 1987; Farin 1996). 

Ex 4.12: Hough transform line detector Implement a Hough transform for finding lines 
in images: 


4.5 Exercises 


265 


1 . Create an accumulator array of the appropriate user-specified size and clear it. The user 
can specify the spacing in degrees between orientation bins and in pixels between dis- 
tance bins. The array can be allocated as integer (for simple counts), floating point (for 
weighted counts), or as an array of vectors for keeping back pointers to the constituent 
edges. 

2. For each detected edgel at location (x, y) and orientation 9 = tan -1 n y /n x , compute 
the value of 

d = xn x + yn v (4.33) 

and increment the accumulator corresponding to ( 9 , d). 

(Optional) Weight the vote of each edge by its length (see Exercise 4.7) or the strength 
of its gradient. 

3. (Optional) Smooth the scalar accumulator array by adding in values from its immediate 
neighbors. This can help counteract the discretization effect of voting for only a single 
bin — see Exercise 3.7. 

4. Find the largest peaks (local maxima) in the accumulator corresponding to lines. 

5. (Optional) For each peak, re-fit the lines to the constituent edgels, using total least 
squares (Appendix A. 2). Use the original edgel lengths or strength weights to weight 
the least squares fit, as well as the agreement between the hypothesized line orienta- 
tion and the edgel orientation. Determine whether these heuristics help increase the 
accuracy of the fit. 

6. After fitting each peak, zero-out or eliminate that peak and its adjacent bins in the array, 
and move on to the next largest peak. 

Test out your Hough transform on a variety of images taken indoors and outdoors, as well 
as checkerboard calibration patterns. 

For checkerboard patterns, you can modify your Hough transform by collapsing antipodal 
bins (9 ± 180°, — d) with ( 9 , d) to find lines that do not care about polarity changes. Can you 
think of examples in real-world images where this might be desirable as well? 

Ex 4.13: Line fitting uncertainty Estimate the uncertainty (covariance) in your line fit us- 
ing uncertainty analysis. 

1. After determining which edgels belong to the line segment (using either successive 
approximation or Hough transform), re-fit the line segment using total least squares 
(Van Huffel and Vandewalle 1991; Van Huffel and Lemmerling 2002), i.e., find the 
mean or centroid of the edgels and then use eigenvalue analysis to find the dominant 
orientation. 
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2. Compute the perpendicular errors (deviations) to the line and robustly estimate the 
variance of the fitting noise using an estimator such as MAD (Appendix B.3). 

3. (Optional) re-fit the line parameters by throwing away outliers or using a robust norm 
or influence function. 

4. Estimate the error in the perpendicular location of the line segment and its orientation. 

Ex 4.14: Vanishing points Compute the vanishing points in an image using one of the tech- 
niques described in Section 4.3.3 and optionally refine the original line equations associated 
with each vanishing point. Your results can be used later to track a target (Exercise 6.5) or 
reconstruct architecture (Section 12.6.1). 

Ex 4.15: Vanishing point uncertainty Perform an uncertainty analysis on your estimated 
vanishing points. You will need to decide how to represent your vanishing point, e.g., homo- 
geneous coordinates on a sphere, to handle vanishing points near infinity. 

See the discussion of Bingham distributions by Collins and Weiss (1990) for some ideas. 
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(e) (f) 


Figure 5.1 Some popular image segmentation techniques: (a) active contours (Isard and 
Blake 1998) © 1998 Springer; (b) level sets (Cremers, Rousson, and Deriche 2007) © 
2007 Springer; (c) graph-based merging (Felzenszwalb and Huttenlocher 2004b) © 2004 
Springer; (d) mean shift (Comaniciu and Meer 2002) © 2002 IEEE; (e) texture and interven- 
ing contour-based normalized cuts (Malik, Belongie, Leung el al. 2001) © 2001 Springer; 
(f) binary MRF solved using graph cuts (Boykov and Funka-Lea 2006) © 2006 Springer. 
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Image segmentation is the task of finding groups of pixels that “go together”. In statistics, this 
problem is known as cluster analysis and is a widely studied area with hundreds of different 
algorithms (Jain and Dubes 1988; Kaufman and Rousseeuw 1990; Jain, Duin, and Mao 2000; 
Jain, Topchy, Law et al. 2004). 

In computer vision, image segmentation is one of the oldest and most widely studied prob- 
lems (Brice and Fennema 1970; Pavlidis 1977; Riseman and Arbib 1977; Ohlander, Price, 
and Reddy 1978; Rosenfeld and Davis 1979; Haralick and Shapiro 1985). Early techniques 
tend to use region splitting or merging (Brice and Fennema 1970; Horowitz and Pavlidis 1976; 
Ohlander, Price, and Reddy 1978; Pavlidis and Liow 1990), which correspond to divisive and 
agglomerative algorithms in the clustering literature (Jain, Topchy, Law et al. 2004). More 
recent algorithms often optimize some global criterion, such as intra-region consistency and 
inter-region boundary lengths or dissimilarity (Leclerc 1989; Mumford and Shah 1989; Shi 
and Malik 2000; Comaniciu and Meer 2002; Felzenszwalb and Huttenlocher 2004b; Cremers, 
Rousson, and Deriche 2007). 

We have already seen examples of image segmentation in Sections 3.3.2 and 3.7.2. In 
this chapter, we review some additional techniques that have been developed for image seg- 
mentation. These include algorithms based on active contours (Section 5.1) and level sets 
(Section 5.1.4), region splitting and merging (Section 5.2), mean shift (mode finding) (Sec- 
tion 5.3), normalized cuts (splitting based on pixel similarity metrics) (Section 5.4), and bi- 
nary Markov random fields solved using graph cuts (Section 5.5). Figure 5.1 shows some 
examples of these techniques applied to different images. 

Since the literature on image segmentation is so vast, a good way to get a handle on some 
of the better performing algorithms is to look at experimental comparisons on human-labeled 
databases (Arbelaez, Maire, Fowlkes et al. 2010). The best known of these is the Berkeley 
Segmentation Dataset and Benchmark 1 (Martin, Fowlkes, Tal et al. 2001), which consists 
of 1000 images from a Corel image dataset that were hand-labeled by 30 human subjects. 
Many of the more recent image segmentation algorithms report comparative results on this 
database. For example, Unnikrishnan, Pantofaru, and Hebert (2007) propose new metrics 
for comparing such algorithms. Estrada and Jepson (2009) compare four well-known seg- 
mentation algorithms on the Berkeley data set and conclude that while their own SE-MinCut 
algorithm (Estrada, Jepson, and Chennubhotla 2004) algorithm outperforms the others by a 
small margin, there still exists a wide gap between automated and human segmentation per- 
formance. 2 A new database of foreground and background segmentations, used by Alpert, 
Galun, Basri et al. (2007), is also available. 3 

1 http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/ 

2 An interesting observation about their ROC plots is that automated techniques cluster tightly along similar 
curves, but human performance is all over the map. 

3 http://www.wisdom.weizmann.ac.il/~vision/Seg_Evaluation_DB/index.html 
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5.1 Active contours 

While lines, vanishing points, and rectangles are commonplace in the man-made world, 
curves corresponding to object boundaries are even more common, especially in the natural 
environment. In this section, we describe three related approaches to locating such boundary 
curves in images. 

The first, originally called snakes by its inventors (Kass, Witkin, and Terzopoulos 1988) 
(Section 5.1.1), is an energy-minimizing, two-dimensional spline curve that evolves (moves) 
towards image features such as strong edges. The second, intelligent scissors (Mortensen 
and Barrett 1995) (Section 5.1.3), allow the user to sketch in real time a curve that clings to 
object boundaries. Finally, level set techniques (Section 5.1.4) evolve the curve as the zero- 
set of a characteristic function, which allows them to easily change topology and incorporate 
region-based statistics. 

All three of these are examples of active contours (Blake and Isard 1998; Mortensen 
1999), since these boundary detectors iteratively move towards their final solution under the 
combination of image and optional user-guidance forces. 

5.1.1 Snakes 

Snakes are a two-dimensional generalization of the ID energy-minimizing splines first intro- 
duced in Section 3.7.1, 

&nt = J a(s)||/ s (s)|| 2 + /?(s)||/ ss (s)|| 2 ds, (5.1) 

where s is the arc-length along the curve /(s) = (x(s),y(s)) and a(s) and /3(s) are first- 
and second-order continuity weighting functions analogous to the s(x,y ) and c(x,y) terms 
introduced in (3.100-3.101). We can discretize this energy by sampling the initial curve 
position evenly along its length (Figure 4.35) to obtain 

Eint = ^a(i)||/(i + l)-/(f)|| 2 //i 2 (5.2) 

i 

+ /3(®)ll/(® + 1) ~ 2/(i) + f(i— l)|| 2 //i 4 , 

where h is the step size, which can be neglected if we resample the curve along its arc-length 
after each iteration. 

In addition to this internal spline energy, a snake simultaneously minimizes external 
image-based and constraint-based potentials. The image-based potentials are the sum of sev- 
eral terms 

^image — ^line^l ine “1“ ^edge^edge H - ^term^term? (5.3) 
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Figure 5.2 Snakes (Kass, Witkin, and Terzopoulos 1988) © 1988 Springer: (a) the “snake 
pit” for interactively controlling shape; (b) lip tracking. 


where the line term attracts the snake to dark ridges, the edge term attracts it to strong gradi- 
ents (edges), and the term term attracts it to line terminations. In practice, most systems only 
use the edge term, which can either be directly proportional to the image gradients, 

Sedge = 5>||V/(/(z))|| 2 , (5.4) 

i 

or to a smoothed version of the image Laplacian, 

Sedge = 5>|(G ct *V 2 /)(/(*))| 2 . (5.5) 

i 

People also sometimes extract edges and then use a distance map to the edges as an alternative 
to these two originally proposed potentials. 

In interactive applications, a variety of user-placed constraints can also be added, e.g., 
attractive (spring) forces towards anchor points d(i), 

S sprin g = ki\\f(i) - d(i)W 2 , (5.6) 

as well as repulsive 1/r (“volcano”) forces (Figure 5.2a). As the snakes evolve by minimiz- 
ing their energy, they often “wiggle” and “slither”, which accounts for their popular name. 
Figure 5.2b shows snakes being used to track a person’s lips. 

Because regular snakes have a tendency to shrink (Exercise 5.1), it is usually better to 
initialize them by drawing the snake outside the object of interest to be tracked. Alterna- 
tively, an expansion ballooning force can be added to the dynamics (Cohen and Cohen 1993), 
essentially moving each point outwards along its normal. 

To efficiently solve the sparse linear system arising from snake energy minimization, a 
sparse direct solver (Appendix A.4) can be used, since the linear system is essentially penta- 
diagonal. 4 Snake evolution is usually implemented as an alternation between this linear sys- 


4 A closed snake has a Toeplitz matrix form, which can still be factored and solved in O(N) time. 
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Figure 5.3 Elastic net: The open squares indicate the cities and the closed squares linked by 
straight line segments are the tour points. The blue circles indicate the approximate extent of 
the attraction force of each city, which is reduced over time. Under the Bayesian interpretation 
of the elastic net, the blue circles correspond to one standard deviation of the circular Gaussian 
that generates each city from some unknown tour point. 


tem solution and the linearization of non-linear constraints such as edge energy. A more direct 
way to find a global energy minimum is to use dynamic programming (Amini, Weymouth, 
and Jain 1990; Williams and Shah 1992), but this is not often used in practice, since it has 
been superseded by even more efficient or interactive algorithms such as intelligent scissors 
(Section 5.1.3) and GrabCut (Section 5.5). 

Elastic nets and slippery springs 

An interesting variant on snakes, first proposed by Durbin and Willshaw (1987) and later 
re-formulated in an energy-minimizing framework by Durbin, Szeliski, and Yuille (1989), is 
the elastic net formulation of the Traveling Salesman Problem (TSP). Recall that in a TSP, 
the salesman must visit each city once while minimizing the total distance traversed. A snake 
that is constrained to pass through each city could solve this problem (without any optimality 
guarantees) but it is impossible to tell ahead of time which snake control point should be 
associated with each city. 

Instead of having a fixed constraint between snake nodes and cities, as in (5.6), a city is 
assumed to pass near some point along the tour (Figure 5.3). In a probabilistic interpretation, 
each city is generated as a mixture of Gaussians centered at each tour point, 

p(d(j)) = 'YlfPij with Pij = (5.7) 


where a is the standard deviation of the Gaussian and 


d ij = II /(*) - d U) II 


(5.8) 
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is the Euclidean distance between a tour point f(i) and a city location d(;j). The correspond- 
ing data fitting energy (negative log likelihood) is 


^slippery = - lo gp(d(j)) = - ^ log e ^ 

3 3 


(5.9) 


This energy derives its name from the fact that, unlike a regular spring, which couples a 
given snake point to a given constraint (5.6), this alternative energy defines a slippery spring 
that allows the association between constraints (cities) and curve (tour) points to evolve over 
time (Szeliski 1989). Note that this is a soft variant of the popular iterated closest point 
data constraint that is often used in fitting or aligning surfaces to data points or to each other 
(Section 12.2.1) (Besl and McKay 1992; Zhang 1994). 

To compute a good solution to the TSP, the slippery spring data association energy is 
combined with a regular first-order internal smoothness energy (5.3) to define the cost of a 
tour. The tour f(s) is initialized as a small circle around the mean of the city points and a is 
progressively lowered (Figure 5.3). For large a values, the tour tries to stay near the centroid 
of the points but as a decreases each city pulls more and more strongly on its closest tour 
points (Durbin, Szeliski, and Yuille 1989). In the limit as a — > 0, each city is guaranteed to 
capture at least one tour point and the tours between subsequent cites become straight lines. 


Splines and shape priors 

While snakes can be very good at capturing the fine and irregular detail in many real-world 
contours, they sometimes exhibit too many degrees of freedom, making it more likely that 
they can get trapped in local minima during their evolution. 

One solution to this problem is to control the snake with fewer degrees of freedom through 
the use of B-spline approximations (Menet, Saint-Marc, and Medioni 1990b, a; Cipolla and 
Blake 1990). The resulting B-snake can be written as 

f{s) = Y J B k{s)x k (5.10) 

k 

or in discrete form as 

F=BX (5.11) 

with 


' f T ( 0) ' 


B o(so) 

Bk(so) 


x T (0) 


, B = 



, and X = 


_ f T (N ) _ 


Bq(sn) ■ 

■ B k (sn ) 


x T (K) 


(5.12) 
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(a) (b) (c) (d) 


Figure 5.4 Point distribution model for a set of resistors (Cootes, Cooper, Taylor et al. 
1995) © 1995 Elsevier: (a) set of input resistor shapes; (b) assignment of control points 
to the boundary; (c) distribution (scatter plot) of point locations; (d) first (largest) mode of 
variation in the ensemble shapes. 


If the object being tracked or recognized has large variations in location, scale, or ori- 
entation, these can be modeled as an additional transformation on the control points, e.g., 
x' k = sRxk + t (2.18), which can be estimated at the same time as the values of the control 
points. Alternatively, separate detection and alignment stages can be run to first localize and 
orient the objects of interest (Cootes, Cooper, Taylor et al. 1995). 

In a B-snake, because the snake is controlled by fewer degrees of freedom, there is less 
need for the internal smoothness forces used with the original snakes, although these can still 
be derived and implemented using finite element analysis, i.e., taking derivatives and integrals 
of the B-spline basis functions (Terzopoulos 1983; Bathe 2007). 

In practice, it is more common to estimate a set of shape priors on the typical distribution 
of the control points {xk} (Cootes, Cooper, Taylor et al. 1995). Consider the set of resistor 
shapes shown in Figure 5.4a. If we describe each contour with the set of control points 
shown in Figure 5.4b, we can plot the distribution of each point in a scatter plot, as shown in 
Figure 5.4c. 

One potential way of describing this distribution would be by the location Xk and 2D 
covariance Ck of each individual point x k . These could then be turned into a quadratic 
penalty (prior energy) on the point location, 

= k •tCk) C k *£&)• (5.13) 

In practice, however, the variation in point locations is usually highly correlated. 

A preferable approach is to estimate the joint covariance of all the points simultaneously. 
First, concatenate all of the point locations {x^} into a single vector x , e.g., by interleaving 
the x and y locations of each point. The distribution of these vectors across all training 
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Figure 5.5 Active Shape Model (ASM): (a) the effect of varying the first four shape param- 
eters for a set of faces (Cootes, Taylor, Lanitis et al. 1993) © 1993 IEEE; (b) searching for 
the strongest gradient along the normal to each control point (Cootes, Cooper, Taylor et al. 
1995) © 1995 Elsevier. 


examples (Figure 5.4a) can be described with a mean x and a covariance 

C = - x)( x p -x) T , (5.14) 

p 

where x p are the P training examples. Using eigenvalue analysis (Appendix A. 1.2), which is 
also known as Principal Component Analysis (PCA) (Appendix B.1.1), the covariance matrix 
can be written as, 

C = $diag(A 0 ...A if _i) <f> T . (5.15) 


In most cases, the likely appearance of the points can be modeled using only a few eigen- 
vectors with the largest eigenvalues. The resulting point distribution model (Cootes, Taylor, 
Lanitis et al. 1993; Cootes, Cooper, Taylor el al. 1995) can be written as 

x = x + <!> b, (5.16) 

where b is an M <C K element shape parameter vector and $ are the first m columns of T'. 
To constrain the shape parameters to reasonable values, we can use a quadratic penalty of the 
form 

^shape = 2 h r dia g(Ao ■ • ■ Am-i) b = ^ 6^/2A m . (5.17) 

m 

Alternatively, the range of allowable b m values can be limited to some range, e.g., < 

3\/ X Tn (Cootes, Cooper, Taylor et al. 1995). Alternative approaches for deriving a set of 
shape vectors are reviewed by Isard and Blake (1998). 
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Varying the individual shape parameters b m over the range — f l\J X m < 2\/A„, can give 
a good indication of the expected variation in appearance, as shown in Figure 5.4d. Another 
example, this time related to face contours, is shown in Figure 5.5a. 

In order to align a point distribution model with an image, each control point searches 
in a direction normal to the contour to find the most likely corresponding image edge point 
(Figure 5.5b). These individual measurements can be combined with priors on the shape 
parameters (and, if desired, position, scale, and orientation parameters) to estimate a new set 
of parameters. The resulting Active Shape Model (ASM) can be iteratively minimized to fit 
images to non-rigidly deforming objects such as medical images or body parts such as hands 
(Cootes, Cooper, Taylor et al. 1995). The ASM can also be combined with a PCA analysis of 
the underlying gray-level distribution to create an Active Appearance Model (AAM) (Cootes, 
Edwards, and Taylor 2001), which we discuss in more detail in Section 14.2.2. 

5.1.2 Dynamic snakes and CONDENSATION 

In many applications of active contours, the object of interest is being tracked from frame 
to frame as it deforms and evolves. In this case, it makes sense to use estimates from the 
previous frame to predict and constrain the new estimates. 

One way to do this is to use Kalman filtering, which results in a formulation called Kalman 
snakes (Terzopoulos and Szeliski 1992; Blake, Curwen, and Zisserman 1993). The Kalman 
filter is based on a linear dynamic model of shape parameter evolution, 


x t = Ax t - i+w t , (5.18) 

where x t and x t ~i are the current and previous state variables, A is the linear transition 
matrix , and w is a noise (perturbation) vector, which is often modeled as a Gaussian (Gelb 
1974). The matrices A and the noise covariance can be learned ahead of time by observing 
typical sequences of the object being tracked (Blake and Isard 1998). 

The qualitative behavior of the Kalman filter can be seen in Figure 5.6a. The linear dy- 
namic model causes a deterministic change (drift) in the previous estimate, while the process 
noise (perturbation) causes a stochastic diffusion that increases the system entropy (lack of 
certainty). New measurements from the current frame restore some of the certainty (peaked- 
ness) in the updated estimate. 

In many situations, however, such as when tracking in clutter, a better estimate for the 
contour can be obtained if we remove the assumptions that the distribution are Gaussian, 
which is what the Kalman filter requires. In this case, a general multi-modal distribution is 
propagated, as shown in Figure 5.6b. In order to model such multi-modal distributions, Isard 
and Blake (1998) introduced the use of particle filtering to the computer vision community. 5 


5 Alternatives to modeling multi-modal distributions include mixtures of Gaussians (Bishop 2006) and multiple 
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(a) 



(b) 


Figure 5.6 Probability density propagation (Isard and Blake 1998) © 1998 Springer. At 
the beginning of each estimation step, the probability density is updated according to the 
linear dynamic model (deterministic drift) and its certainty is reduced due to process noise 
(stochastic diffusion). New measurements introduce additional information that helps refine 
the current estimate, (a) The Kalman filter models the distributions as uni-modal, i.e., using a 
mean and covariance, (b) Some applications require more general multi-modal distributions. 
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(b) 


Figure 5.7 Factored sampling using particle filter in the CONDENSATION algorithm (Is- 
ard and Blake 1998) © 1998 Springer: (a) each density distribution is represented using a 
superposition of weighted particles', (b) the drift-diffusion-measurement cycle implemented 
using random sampling, perturbation, and re-weighting stages. 
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Figure 5.8 Head tracking using CONDENSATION (Isard and Blake 1998) © 1998 
Springer: (a) sample set representation of head estimate distribution; (b) multiple measure- 
ments at each control vertex location; (c) multi-hypothesis tracking over time. 


Particle filtering techniques represent a probability distribution using a collection of weighted 
point samples (Figure 5.7a) (Andrieu, de Freitas, Doucet et al. 2003; Bishop 2006; Koller 
and Friedman 2009). To update the locations of the samples according to the linear dy- 
namics (deterministic drift), the centers of the samples are updated according to (5.18) and 
multiple samples are generated for each point (Figure 5.7b). These are then perturbed to 
account for the stochastic diffusion, i.e., their locations are moved by random vectors taken 
from the distribution of w. b Finally, the weights of these samples are multiplied by the mea- 
surement probability density, i.e., we take each sample and measure its likelihood given the 
current (new) measurements. Because the point samples represent and propagate conditional 
estimates of the multi-modal density, Isard and Blake (1998) dubbed their algorithm CONdi- 
tional DENSity propagATION or CONDENSATION. 

Figure 5.8a shows what a factored sample of a head tracker might look like, drawing 
a red B-spline contour for each of (a subset of) the particles being tracked. Figure 5.8b 
shows why the measurement density itself is often multi-modal: the locations of the edges 
perpendicular to the spline curve can have multiple local maxima due to background clutter. 
Finally, Figure 5.8c shows the temporal evolution of the conditional density ( x coordinate of 
the head and shoulder tracker centroid) as it tracks several people over time. 


hypothesis tracking (Bar-Shalom and Fortmann 1988; Cham and Rehg 1999). 

6 Note that because of the structure of these steps, non-linear dynamics and non-Gaussian noise can be used. 
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Figure 5.9 Intelligent scissors: (a) as the mouse traces the white path, the scissors follow 
the orange path along the object boundary (the green curves show intermediate positions) 
(Mortensen and Barrett 1995) © 1995 ACM; (b) regular scissors can sometimes jump to a 
stronger (incorrect) boundary; (c) after training to the previous segment, similar edge profiles 
are preferred (Mortensen and Barrett 1998) © 1995 Elsevier. 


5.1.3 Scissors 

Active contours allow a user to roughly specify a boundary of interest and have the system 
evolve the contour towards a more accurate location as well as track it over time. The results 
of this curve evolution, however, may be unpredictable and may require additional user-based 
hints to achieve the desired result. 

An alternative approach is to have the system optimize the contour in real time as the 
user is drawing (Mortensen 1999). The intelligent scissors system developed by Mortensen 
and Barrett (1995) does just that. As the user draws a rough outline (the white curve in 
Figure 5.9a), the system computes and draws a better curve that clings to high-contrast edges 
(the orange curve). 

To compute the optimal curve path ( live-wire ), the image is first pre-processed to associate 
low costs with edges (links between neighboring horizontal, vertical, and diagonal, i.e., A f$ 
neighbors) that are likely to be boundary elements. Their system uses a combination of zero- 
crossing, gradient magnitudes, and gradient orientations to compute these costs. 

Next, as the user traces a rough curve, the system continuously recomputes the lowest- 
cost path between the starting seed point and the current mouse location using Dijkstra’s al- 
gorithm, a breadth-first dynamic programming algorithm that terminates at the current target 
location. 

In order to keep the system from jumping around unpredictably, the system will “freeze” 
the curve to date (reset the seed point) after a period of inactivity. To prevent the live wire 
from jumping onto adjacent higher-contrast contours, the system also “learns” the intensity 
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Figure 5.10 Level set evolution for a geodesic active contour. The embedding function <f> 
is updated based on the curvature of the underlying surface modulated by the edge/speed 
function g(I), as well as the gradient of g(I), thereby attracting it to strong edges. 


profile under the current optimized curve, and uses this to preferentially keep the wire moving 
along the same (or a similar looking) boundary (Figure 5.9b-c). 

Several extensions have been proposed to the basic algorithm, which works remarkably 
well even in its original form. Mortensen and Barrett (1999) use tobogganing , which is a 
simple form of watershed region segmentation, to pre-segment the image into regions whose 
boundaries become candidates for optimized curve paths. The resulting region boundaries 
are turned into a much smaller graph, where nodes are located wherever three or four regions 
meet. The Dijkstra algorithm is then run on this reduced graph, resulting in much faster (and 
often more stable) performance. Another extension to intelligent scissors is to use a proba- 
bilistic framework that takes into account the current trajectory of the boundary, resulting in 
a system called JetStream (Perez, Blake, and Gangnet 2001). 

Instead of re-computing an optimal curve at each time instant, a simpler system can be 
developed by simply “snapping” the current mouse position to the nearest likely boundary 
point (Gleicher 1995). Applications of these boundary extraction techniques to image cutting 
and pasting are presented in Section 10.4. 

5.1.4 Level Sets 

A limitation of active contours based on parametric curves of the form /(s), e.g., snakes, B- 
snakes, and CONDENSATION, is that it is challenging to change the topology of the curve 
as it evolves. (Mclnerney and Terzopoulos (1999, 2000) describe one approach to doing 
this.) Furthermore, if the shape changes dramatically, curve reparameterization may also be 
required. 

An alternative representation for such closed contours is to use a level set, where the zero- 
crossing(s) of a characteristic (or signed distance (Section 3.3.3)) function define the curve. 
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Level sets evolve to fit and track objects of interest by modifying the underlying embedding 
function (another name for this 2D function) <j>(x,y) instead of the curve /(s) (Malladi, 
Sethian, and Vemuri 1995; Sethian 1999; Sapiro 2001; Osher and Paragios 2003). To reduce 
the amount of computation required, only a small strip (frontier) around the locations of the 
current zero-crossing needs to updated at each step, which results in what are called fast 
marching methods (Sethian 1999). 

An example of an evolution equation is the geodesic active contour proposed by Caselles, 
Kimmel, and Sapiro (1997) and Yezzi, Kichenassamy, Kumar et al. (1997), 

f = Wv (»W|^|) 

= g(I) |V 0|div + X7g(I) ■ V0, (5.19) 

where g(I) is a generalized version of the snake edge potential (5.5). To get an intuitive sense 
of the curve’s behavior, assume that the embedding function </> is a signed distance function 
away from the curve (Figure 5.10), in which case \<p\ = 1. The first term in Equation (5.19) 
moves the curve in the direction of its curvature, i.e., it acts to straighten the curve, under 
the influence of the modulation function g(I). The second term moves the curve down the 
gradient of g(I), encouraging the curve to migrate towards minima of g(I). 

While this level-set formulation can readily change topology, it is still susceptible to lo- 
cal minima, since it is based on local measurements such as image gradients. An alternative 
approach is to re-cast the problem in a segmentation framework, where the energy measures 
the consistency of the image statistics (e.g., color, texture, motion) inside and outside the seg- 
mented regions (Cremers, Rousson, and Deriche 2007; Rousson and Paragios 2008; Houhou, 
Thiran, and Bresson 2008). These approaches build on earlier energy-based segmentation 
frameworks introduced by Leclerc (1989), Mumford and Shah (1989), and Chan and Vese 
(1992), which are discussed in more detail in Section 5.5. Examples of such level-set seg- 
mentations are shown in Figure 5.11, which shows the evolution of the level sets from a series 
of distributed circles towards the final binary segmentation. 

For more information on level sets and their applications, please see the collection of 
papers edited by Osher and Paragios (2003) as well as the series of Workshops on Variational 
and Level Set Methods in Computer Vision (Paragios, Faugeras, Chan et al. 2005) and Special 
Issues on Scale Space and Variational Methods in Computer Vision (Paragios and Sgallari 
2009). 

5.1.5 Application : Contour tracking and rotoscoping 

Active contours can be used in a wide variety of object-tracking applications (Blake and Isard 
1998; Yilmaz, Javed, and Shah 2006). For example, they can be used to track facial features 
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(a) 



(b) 


Figure 5.11 Level set segmentation (Cremers, Rousson, and Deriche 2007) © 2007 
Springer: (a) grayscale image segmentation and (b) color image segmentation. Uni-variate 
and multi-variate Gaussians are used to model the foreground and background pixel dis- 
tributions. The initial circles evolve towards an accurate segmentation of foreground and 
background, adapting their topology as they evolve. 


for performance-driven animation (Terzopoulos and Waters 1990; Lee, Terzopoulos, and Wa- 
ters 1995; Parke and Waters 1996; Bregler, Covell, and Slaney 1997) (Figure 5.2b). They can 
also be used to track heads and people, as shown in Figure 5.8, as well as moving vehicles 
(Paragios and Deriche 2000). Additional applications include medical image segmentation, 
where contours can be tracked from slice to slice in computerized tomography (3D medical 
imagery) (Cootes and Taylor 2001) or over time, as in ultrasound scans. 

An interesting application that is closer to computer animation and visual effects is ro- 
toscoping, which uses the tracked contours to deform a set of hand-drawn animations (or 
to modify or replace the original video frames). 7 Agarwala, Hertzmann, Seitz et al. (2004) 
present a system based on tracking hand-drawn B-spline contours drawn at selected keyframes, 
using a combination of geometric and appearance-based criteria (Figure 5. 12). They also pro- 
vide an excellent review of previous rotoscoping and image-based, contour-tracking systems. 

7 The term comes from a device (a rotoscope) that projected frames of a live-action film underneath an acetate so 
that artists could draw animations directly over the actors’ shapes. 
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Figure 5.12 Keyframe-based rotoscoping (Agarwala, Hertzmann, Seitz et al. 2004) © 2004 
ACM: (a) original frames; (b) rotoscoped contours; (c) re-colored blouse; (d) rotoscoped 
hand-drawn animation. 


Additional applications of rotoscoping (object contour detection and segmentation), such 
as cutting and pasting objects from one photograph into another, are presented in Section 10.4. 


5.2 Split and merge 

As mentioned in the introduction to this chapter, the simplest possible technique for seg- 
menting a grayscale image is to select a threshold and then compute connected components 
(Section 3.3.2). Unfortunately, a single threshold is rarely sufficient for the whole image 
because of lighting and intra-object statistical variations. 

In this section, we describe a number of algorithms that proceed either by recursively 
splitting the whole image into pieces based on region statistics or, conversely, merging pixels 
and regions together in a hierarchical fashion. It is also possible to combine both splitting and 
merging by starting with a medium-grain segmentation (in a quadtree representation) and 
then allowing both merging and splitting operations (Horowitz and Pavlidis 1976; Pavlidis 
and Liow 1990). 

5.2.1 Watershed 

A technique related to thresholding, since it operates on a grayscale image, is watershed com- 
putation (Vincent and Soille 1991). This technique segments an image into several catchment 
basins , which are the regions of an image (interpreted as a height field or landscape) where 
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Figure 5.13 Locally constrained watershed segmentation (Beare 2006) © 2006 IEEE: (a) 
original confocal microscopy image with marked seeds (line segments); (b) standard water- 
shed segmentation; (c) locally constrained watershed segmentation. 


rain would flow into the same lake. An efficient way to compute such regions is to start flood- 
ing the landscape at all of the local minima and to label ridges wherever differently evolving 
components meet. The whole algorithm can be implemented using a priority queue of pixels 
and breadth-first search (Vincent and Soille 1991). 8 

Since images rarely have dark regions separated by lighter ridges, watershed segmen- 
tation is usually applied to a smoothed version of the gradient magnitude image, which also 
makes it usable with color images. As an alternative, the maximum oriented energy in a steer- 
able filter (3.28-3.29) (Freeman and Adelson 1991) can be used as the basis of the oriented 
watershed transform developed by Arbelaez, Maire, Fowlkes et al. (2010). Such techniques 
end up finding smooth regions separated by visible (higher gradient) boundaries. Since such 
boundaries are what active contours usually follow, active contour algorithms (Mortensen and 
Barrett 1999; Li, Sun, Tang et al. 2004) often precompute such a segmentation using either 
the watershed or the related tobogganing technique (Section 5.1.3). 

Unfortunately, watershed segmentation associates a unique region with each local mini- 
mum, which can lead to over-segmentation. Watershed segmentation is therefore often used 
as part of an interactive system, where the user first marks seed locations (with a click or 
a short stroke) that correspond to the centers of different desired components. Figure 5.13 
shows the results of running the watershed algorithm with some manually placed markers on 
a confocal microscopy image. It also shows the result for an improved version of watershed 
that uses local morphology to smooth out and optimize the boundaries separating the regions 
(Beare 2006). 


8 A related algorithm can be used to compute maximally stable extremal regions (MSERs) efficiently (Sec- 
tion 4.1.1) (Nister and Stewenius 2008). 
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5.2.2 Region splitting (divisive clustering) 

Splitting the image into successively finer regions is one of the oldest techniques in computer 
vision. Ohlander, Price, and Reddy (1978) present such a technique, which first computes a 
histogram for the whole image and then finds a threshold that best separates the large peaks 
in the histogram. This process is repeated until regions are either fairly uniform or below a 
certain size. 

More recent splitting algorithms often optimize some metric of intra-region similarity and 
inter-region dissimilarity. These are covered in Sections 5.4 and 5.5. 

5.2.3 Region merging (agglomerative clustering) 

Region merging techniques also date back to the beginnings of computer vision. Brice and 
Fennema (1970) use a dual grid for representing boundaries between pixels and merge re- 
gions based on their relative boundary lengths and the strength of the visible edges at these 
boundaries. 

In data clustering, algorithms can link clusters together based on the distance between 
their closest points (single-link clustering), their farthest points (complete-link clustering), or 
something in between (Jain, Topchy, Law et al. 2004). Kamvar, Klein, and Manning (2002) 
provide a probabilistic interpretation of these algorithms and show how additional models 
can be incorporated within this framework. 

A very simple version of pixel-based merging combines adjacent regions whose average 
color difference is below a threshold or whose regions are too small. Segmenting the image 
into such superpixels (Mori, Ren, Efros el al. 2004), which are not semantically meaningful, 
can be a useful pre-processing stage to make higher-level algorithms such as stereo matching 
(Zitnick, Kang, Uyttendaele et al. 2004; Taguchi, Wilburn, and Zitnick 2008), optic flow 
(Zitnick, Jojic, and Kang 2005; Brox, Bregler, and Malik 2009), and recognition (Mori, Ren, 
Efros et al. 2004; Mori 2005; Gu, Lim, Arbelaez et al. 2009; Lim, Arbelaez, Gu et al. 2009) 
both faster and more robust. 


5.2.4 Graph-based segmentation 

While many merging algorithms simply apply a fixed rule that groups pixels and regions 
together, Felzenszwalb and Huttenlocher (2004b) present a merging algorithm that uses rel- 
ative dissimilarities between regions to determine which ones should be merged; it produces 
an algorithm that provably optimizes a global grouping metric. They start with a pixel-to- 
pixel dissimilarity measure w(e) that measures, for example, intensity differences between 
Afs neighbors. (Alternatively, they can use the. joint feature space distances (5.42) introduced 
by Comaniciu and Meer (2002), which we discuss in Section 5.3.2.) 
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(a) (b) (c) 


Figure 5.14 Graph-based merging segmentation (Felzenszwalb and Huttenlocher 2004b) 
© 2004 Springer: (a) input grayscale image that is successfully segmented into three regions 
even though the variation inside the smaller rectangle is larger than the variation across the 
middle edge; (b) input grayscale image; (c) resulting segmentation using an W 8 pixel neigh- 
borhood. 

For any region R, its internal difference is defined as the largest edge weight in the re- 
gion’s minimum spanning tree, 


Int(R) = min w(e). (5.20) 

e€LMST(R) 

For any two adjacent regions with at least one edge connecting their vertices, the difference 
between these regions is defined as the minimum weight edge connecting the two regions, 

Dif(Ri, R 2 ) = min w(e). (5.21) 

e=(vi ,V2) |t>l Gitl ,V2 6-R2 

Their algorithm merges any two adjacent regions whose difference is smaller than the mini- 
mum internal difference of these two regions, 

MInt(Ri, R 2 ) = min(7nf(i?i) + Int(R- 2 ) + t(Rz)), (5.22) 

where t(R) is a heuristic region penalty that Felzenszwalb and Huttenlocher (2004b) set to 
k/\R\, but which can be set to any application-specific measure of region goodness. 

By merging regions in decreasing order of the edges separating them (which can be effi- 
ciently evaluated using a variant of Kruskal’s minimum spanning tree algorithm), they prov- 
ably produce segmentations that are neither too fine (there exist regions that could have been 
merged) nor too coarse (there are regions that could be split without being mergeable). For 
fixed-size pixel neighborhoods, the running time for this algorithm is 0(N log N), where N 
is the number of image pixels, which makes it one of the fastest segmentation algorithms 
(Paris and Durand 2007). Figure 5.14 shows two examples of images segmented using their 
technique. 
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Figure 5.15 Coarse to fine node aggregation in segmentation by weighted aggregation 
(SWA) (Sharon, Galun, Sharon et al. 2006) © 2006 Macmillan Publishers Ltd [Nature]: (a) 
original gray-level pixel grid; (b) inter-pixel couplings, where thicker lines indicate stronger 
couplings; (c) after one level of coarsening, where each original pixel is strongly coupled to 
one of the coarse-level nodes; (d) after two levels of coarsening. 


5.2.5 Probabilistic aggregation 

Alpert, Galun, Basri et al. (2007) develop a probabilistic merging algorithm based on two 
cues, namely gray-level similarity and texture similarity. The gray-level similarity between 
regions R, and Rj is based on the minimal external difference from other neighboring regions, 

° local = min( A+ , A+ ) , (5.23) 

where Af = min/. |A*fc| and A;/, is the difference in average intensities between regions Ri 
and Ilk • This is compared to the average intensity difference, 


A~ + A~ 

— l 1 J 

® local o 


(5.24) 


where A“ = ^ fc ( T ifcAj/.) / X]fc( T ifc) and To- is the boundary length between regions Ri and 
Rk . The texture similarity is defined using relative differences between histogram bins of 
simple oriented Sobel filter responses. The pairwise statistics cr^ ca j and cr7 , are used to 
compute the likelihoods pij that two regions should be merged. (See the paper by Alpert, 
Galun, Basri et al. (2007) for more details.) 

Merging proceeds in a hierarchical fashion inspired by algebraic multigrid techniques 
(Brandt 1986; Briggs, Henson, and McCormick 2000) and previously used by Alpert, Galun, 
Basri et al. (2007) in their segmentation by weighted aggregation (SWA) algorithm (Sharon, 
Galun, Sharon et al. 2006), which we discuss in Section 5.4. A subset of the nodes C C V 
that are (collectively) strongly coupled to all of the original nodes (regions) are used to define 
the problem at a coarser scale (Figure 5.15), where strong coupling is defined as 


^jecPij 


> 0 , 


(5.25) 
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with <j) usually set to 0.2. The intensity and texture similarity statistics for the coarser nodes 
are recursively computed using weighted averaging, where the relative strengths (couplings) 
between coarse- and fine-level nodes are based on their merge probabilities Pij . This allows 
the algorithm to run in essentially O(N) time, using the same kind of hierarchical aggrega- 
tion operations that are used in pyramid-based filtering or preconditioning algorithms. After 
a segmentation has been identified at a coarser level, the exact memberships of each pixel are 
computed by propagating coarse-level assignments to their finer-level “children” (Sharon, 
Galun, Sharon el al. 2006; Alpert, Galun, Basri el al. 2007). Figure 5.22 shows the segmen- 
tations produced by this algorithm compared to other popular segmentation algorithms. 


5.3 Mean shift and mode finding 

Mean-shift and mode finding techniques, such as k-means and mixtures of Gaussians, model 
the feature vectors associated with each pixel (e.g., color and position) as samples from an 
unknown probability density function and then try to find clusters (modes) in this distribution. 

Consider the color image shown in Figure 5.16a. How would you segment this image 
based on color alone? Figure 5.16b shows the distribution of pixels in L*u*v* space, which 
is equivalent to what a vision algorithm that ignores spatial location would see. To make the 
visualization simpler, let us only consider the L*u* coordinates, as shown in Figure 5.16c. 
How many obvious (elongated) clusters do you see? How would you go about finding these 
clusters? 

The k-means and mixtures of Gaussians techniques use a parametric model of the den- 
sity function to answer this question, i.e., they assume the density is the superposition of a 
small number of simpler distributions (e.g., Gaussians) whose locations (centers) and shape 
(covariance) can be estimated. Mean shift, on the other hand, smoothes the distribution and 
finds its peaks as well as the regions of feature space that correspond to each peak. Since 
a complete density is being modeled, this approach is called non-parametric (Bishop 2006). 
Let us look at these techniques in more detail. 


5.3.1 K-means and mixtures of Gaussians 

While k-means implicitly models the probability density as a superposition of spherically 
symmetric distributions, it does not require any probabilistic reasoning or modeling (Bishop 
2006). Instead, the algorithm is given the number of clusters k it is supposed to find; it 
then iteratively updates the cluster center location based on the samples that are closest to 
each center. The algorithm can be initialized by randomly sampling k centers from the input 
feature vectors. Techniques have also been developed for splitting or merging cluster centers 
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Figure 5.16 Mean-shift image segmentation (Comaniciu and Meer 2002) © 2002 IEEE: 
(a) input color image; (b) pixels plotted in L*u*v* space; (c) L*u* space distribution; (d) 
clustered results after 159 mean-shift procedures; (e) corresponding trajectories with peaks 
marked as red dots. 
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based on their statistics, and for accelerating the process of finding the nearest mean center 
(Bishop 2006). 

In mixtures of Gaussians, each cluster center is augmented by a covariance matrix whose 
values are re-estimated from the corresponding samples. Instead of using nearest neighbors 
to associate input samples with cluster centers, a Mahalanobis distance (Appendix B.1.1) is 
used: 

d(xi, = H®, - M/JIyr 1 = (*i - Mfc) T Sfc 1 (®i - Mfc) (5.26) 

where Xi are the input samples, are the cluster centers, and are their covariance es- 
timates. Samples can be associated with the nearest cluster center (a hard assignment of 
membership) or can be softly assigned to several nearby clusters. 

This latter, more commonly used, approach corresponds to iteratively re-estimating the 
parameters for a mixture of Gaussians density function, 

p(x\{n kl p kl 'E k }) = ^7r fe A/'(®|^ fe ,S fe ), (5.27) 

k 

where n k are the mixing coefficients , p k and X!/. are the Gaussian means and covariances, 
and 

M(x\^ £*) = 1 e -d(x,n k V k ) ( 5 . 28 ) 

l^fcl 

is the normal (Gaussian) distribution (Bishop 2006). 

To iteratively compute (a local) maximum likely estimate for the unknown mixture pa- 
rameters { tv k . (i k . £fc}, the expectation maximization (EM) algorithm (Dempster, Laird, and 
Rubin 1977) proceeds in two alternating stages: 

1. The expectation stage (E step) estimates the responsibilities 

z%k = ^-7r fc A/"(®|/Lt fc ,Xlfc) with y = 1, (5.29) 

1 k 

which are the estimates of how likely a sample cc,; was generated from the /. th Gaussian 
cluster. 

2. The maximization stage (M step) updates the parameter values 

^ ^ (5.30) 

i 

= Zik(Xi — Hk){Xi — H k ) ) (5.31) 

Nk 
N ’ 




(5.32) 
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where 

N k = z ik • (5.33) 

i 

is an estimate of the number of sample points assigned to each cluster. 

Bishop (2006) has a wonderful exposition of both mixture of Gaussians estimation and the 
more general topic of expectation maximization. 

In the context of image segmentation, Ma, Derksen, Hong et al. (2007) present a nice 
review of segmentation using mixtures of Gaussians and develop their own extension based 
on Minimum Description Length (MDL) coding, which they show produces good results on 
the Berkeley segmentation database. 

5.3.2 Mean shift 

While k-means and mixtures of Gaussians use a parametric form to model the probability den- 
sity function being segmented, mean shift implicitly models this distribution using a smooth 
continuous non-parametric model. The key to mean shift is a technique for efficiently find- 
ing peaks in this high-dimensional data distribution without ever computing the complete 
function explicitly (Fukunaga and Hostetler 1975; Cheng 1995; Comaniciu and Meer 2002). 

Consider once again the data points shown in Figure 5.16c, which can be thought of as 
having been drawn from some probability density function. If we could compute this density 
function, as visualized in Figure 5.16e, we could find its major peaks {modes) and identify 
regions of the input space that climb to the same peak as being part of the same region. This 
is the inverse of the watershed algorithm described in Section 5.2.1, which climbs downhill 
to find basins of attraction. 

The first question, then, is how to estimate the density function given a sparse set of 
samples. One of the simplest approaches is to just smooth the data, e.g., by convolving it 
with a fixed kernel of width h, 

f( X )=Yl K ( X -Xi)=Yl k ( ^ X )’ (5 ' 34) 

i i ' ' 

where x, are the input samples and k(r) is the kernel function (or Parzen window ). 9 This 
approach is known as kernel density estimation or the Parzen window technique (Duda, Hart, 
and Stork 2001, Section 4.3; Bishop 2006, Section 2.5.1). Once we have computed f(x), as 
shown in Figures 5.16e and 5.17, we can find its local maxima using gradient ascent or some 
other optimization technique. 

9 In this simplified formula, a Euclidean metric is used. We discuss a little later (5.42) how to generalize this 
to non-uniform (scaled or oriented) metrics. Note also that this distribution may not be proper , i.e., integrate to 1. 
Since we are looking for maxima in the density, this does not matter. 
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Figure 5.17 One-dimensional visualization of the kernel density estimate, its derivative, and 
a mean shift. The kernel density estimate f(x) is obtained by convolving the sparse set of 
input samples with the kernel function K(x). The derivative of this function, f'(x), can 
be obtained by convolving the inputs with the derivative kernel G(x). Estimating the local 
displacement vectors around a current estimate x k results in the mean-shift vector m( x k ), 
which, in a multi-dimensional setting, point in the same direction as the function gradient 
V/(.x The red dots indicate local maxima in f(x) to which the mean shifts converge. 


The problem with this “brute force” approach is that, for higher dimensions, it becomes 
computationally prohibitive to evaluate f(x) over the complete search space. 10 Instead, mean 
shift uses a variant of what is known in the optimization literature as multiple restart gradient 
descent. Starting at some guess for a local maximum, y k , which can be a random input data 
point Xi, mean shift computes the gradient of the density estimate f(x) at y k and takes an 
uphill step in that direction (Figure 5.17). The gradient of f(x) is given by 

V/(a;) = ~ X ) G ( X ~ x i) = Yl( Xi ~ x h 2 ^ ) ’ (5-35) 


where 

g(r) = —k'(r), (5.36) 

and k'(r) is the first derivative of k(r). We can re-write the gradient of the density function 
as 


V/(a) = 


x - x i) 


m(x), 


where the vector 


, X Ei x iG(x - Xi) 
m(x) = x 


(5.37) 


(5.38) 


J2i G( x ~ x i) 

is called the mean shift , since it is the difference between the weighted mean of the neighbors 
Xi around x and the current value of x. 


Even for one dimension, if the space is extremely sparse, it may be inefficient. 
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In the mean-shift procedure, the current estimate of the mode y k at iteration k is replaced 
by its locally weighted mean. 


Comaniciu and Meer (2002) prove that this algorithm converges to a local maximum of f(x) 
under reasonably weak conditions on the kernel fc(r), i.e., that it is monotonically decreasing. 
This convergence is not guaranteed for regular gradient descent unless appropriate step size 
control is used. 

The two kernels that Comaniciu and Meer (2002) studied are the Epanechnikov kernel. 


which is a radial generalization of a bilinear kernel, and the Gaussian (normal) kernel. 


The corresponding derivative kernels g(r ) are a unit ball and another Gaussian, respectively. 
Using the Epanechnikov kernel converges in a finite number of steps, while the Gaussian 
kernel has a smoother trajectory (and produces better results), but converges very slowly near 
a mode (Exercise 5.5). 

The simplest way to apply mean shift is to start a separate mean-shift mode estimate 
y at every input point x-, and to iterate for a fixed number of steps or until the mean-shift 
magnitude is below a threshold. A faster approach is to randomly subsample the input points 
Xj and to keep track of each point’s temporal evolution. The remaining points can then be 
classified based on the nearest evolution path (Comaniciu and Meer 2002). Paris and Durand 
(2007) review a number of other more efficient implementations of mean shift, including their 
own approach, which is based on using an efficient low-resolution estimate of the complete 
multi-dimensional space of f(x) along with its stationary points. 

The color-based segmentation shown in Figure 5.16 only looks at pixel colors when deter- 
mining the best clustering. It may therefore cluster together small isolated pixels that happen 
to have the same color, which may not correspond to a semantically meaningful segmentation 
of the image. 

Better results can usually be obtained by clustering in the joint domain of color and lo- 
cation. In this approach, the spatial coordinates of the image x s = (x, ;</), which are called 
the spatial domain, are concatenated with the color values x r , which are known as the range 
domain, and mean-shift clustering is applied in this five-dimensional space x 3 . Since location 
and color may have different scales, the kernels are adjusted accordingly, i.e., we use a kernel 
of the form 


Vk + 1 = Vk + m{y k ) = 


E, XjG(y k - Xj) 

Ei G (Vk - *») 


(5.39) 


= max(0, 1 — r), 


(5.40) 



(5.41) 


5.3 Mean shift and mode finding 


295 



Figure 5.18 Mean-shift color image segmentation with parameters (h s ,h r , M) = 
(16, 19, 40) (Comaniciu and Meer 2002) © 2002 IEEE. 


where separate parameters h s and h, are used to control the spatial and range bandwidths of 
the filter kernels. Figure 5.18 shows an example of mean-shift clustering in the joint domain, 
with parameters (h s ,h r , M) = (16, 19,40), where spatial regions containing less than M 
pixels are eliminated. 

The form of the joint domain filter kernel (5.42) is reminiscent of the bilateral filter kernel 
(3.34-3.37) discussed in Section 3.3.1. The difference between mean shift and bilateral fil- 
tering, however, is that in mean shift the spatial coordinates of each pixel are adjusted along 
with its color values, so that the pixel migrates more quickly towards other pixels with similar 
colors, and can therefore later be used for clustering and segmentation. 

Determining the best bandwidth parameters h to use with mean shift remains something 
of an art, although a number of approaches have been explored. These include optimizing 
the bias-variance tradeoff, looking for parameter ranges where the number of clusters varies 
slowly, optimizing some external clustering criterion, or using top-down (application domain) 
knowledge (Comaniciu and Meer 2003). It is also possible to change the orientation of the 
kernel in joint parameter space for applications such as spatio-temporal (video) segmentations 
(Wang, Thiesson, Xu et al. 2004). 

Mean shift has been applied to a number of different problems in computer vision, includ- 
ing face tracking, 2D shape extraction, and texture segmentation (Comaniciu and Meer 2002), 
and more recently in stereo matching (Chapter 11) (Wei and Quan 2004), non-photorealistic 
rendering (Section 10.5.2) (DeCarlo and Santella 2002), and video editing (Section 10.4.5) 
(Wang, Bhat, Colburn et al. 2005). Paris and Durand (2007) provide a nice review of such 
applications, as well as techniques for more efficiently solving the mean-shift equations and 
producing hierarchical segmentations. 
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Figure 5.19 Sample weighted graph and its normalized cut: (a) a small sample graph and 
its smallest normalized cut; (b) tabular form of the associations and cuts for this graph. The 
assoc and cut entries are computed as area sums of the associated weight matrix W (Fig- 
ure 5.20). Normalizing the table entries by the row or column sums produces normalized 
associations and cuts N assoc and Ncut. 


5.4 Normalized cuts 


While bottom-up merging techniques aggregate regions into coherent wholes and mean-shift 
techniques try to find clusters of similar pixels using mode finding, the normalized cuts 
technique introduced by Shi and Malik (2000) examines the affinities (similarities) between 
nearby pixels and tries to separate groups that are connected by weak affinities. 

Consider the simple graph shown in Figure 5.19a. The pixels in group A are all strongly 
connected with high affinities, shown as thick red lines, as are the pixels in group B. The 
connections between these two groups, shown as thinner blue lines, are much weaker. A 
normalized cut between the two groups, shown as a dashed line, separates them into two 
clusters. 

The cut between two groups A and B is defined as the sum of all the weights being cut. 


cut{A , B) = E Wij , 

ieA,jeB 


(5.43) 


where the weights between two pixels (or regions) i and j measure their similarity. Using 
a minimum cut as a segmentation criterion, however, does not result in reasonable clusters, 
since the smallest cuts usually involve isolating a single pixel. 

A better measure of segmentation is the normalized cut, which is defined as 


Ncut (A, B ) 


cut(A,B) cut(A,B) 
assoc(A,V) assoc(B,V)' 


(5.44) 


where assoc(A, A) = YhieA jeA Wi t ' s '■be association (sum of all the weights) within a 
cluster and assoc(A, V ) = assoc(A, A) + cut(A , B) is the sum of all the weights associated 
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with nodes in A. Figure 5.19b shows how the cuts and associations can be thought of as area 
sums in the weight matrix W = ['«;,,], where the entries of the matrix have been arranged so 
that the nodes in A come first and the nodes in B come second. Figure 5.20 shows an actual 
weight matrix for which these area sums can be computed. Dividing each of these areas by 
the corresponding row sum (the rightmost column of Figure 5.19b) results in the normalized 
cut and association values. These normalized values better reflect the fitness of a particular 
segmentation, since they look for collections of edges that are weak relative to all of the edges 
both inside and emanating from a particular region. 

Unfortunately, computing the optimal normalized cut is NP-complete. Instead, Shi and 
Malik (2000) suggest computing a real-valued assignment of nodes to groups. Let x be the 
indicator vector where x t = +1 iff i £ A and x, = —1 iff i £ B. Let d = W 1 he the row 
sums of the symmetric matrix W and D = diag(d) be the corresponding diagonal matrix. 
Shi and Malik (2000) show that minimizing the normalized cut over all possible indicator 
vectors x is equivalent to minimizing 


min 

y 


y T (P -W)y 
y T Dy 


(5.45) 


where y = ((1 + x) — 6(1 — a:)) /2 is a vector consisting of all Is and —6s such that yd= 0. 
Minimizing this Rayleigh quotient is equivalent to solving the generalized eigenvalue system 


(D - W)y = A Dy, (5.46) 

which can be turned into a regular eigenvalue problem 

(. I-N)z = \z , (5.47) 

where N = D~ 1 / 2 WD ~ 1/2 is the normalized affinity matrix (Weiss 1999) and 2 = 
D^y. Because these eigenvectors can be interpreted as the large modes of vibration in 
a spring-mass system, normalized cuts is an example of a spectral method for image segmen- 
tation. 

Extending an idea originally proposed by Scott and Longuet-Higgins (1990), Weiss (1999) 
suggests normalizing the affinity matrix and then using the top k eigenvectors to reconstitute a 
Q matrix. Other papers have extended the basic normalized cuts framework by modifying the 
affinity matrix in different ways, finding better discrete solutions to the minimization prob- 
lem, or applying multi-scale techniques (Meila and Shi 2000, 2001; Ng, Jordan, and Weiss 
2001; Yu and Shi 2003; Cour, Benezit, and Shi 2005; Tolliver and Miller 2006). 

Figure 5.20b shows the second smallest (real-valued) eigenvector corresponding to the 
weight matrix shown in Figure 5.20a. (Here, the rows have been permuted to separate the 
two groups of variables that belong to the different components of this eigenvector.) Af- 
ter this real-valued vector is computed, the variables corresponding to positive and negative 
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Figure 5.20 Sample weight table and its second smallest eigenvector (Shi and Malik 2000) 
© 2000 IEEE: (a) sample 32 x 32 weight matrix W ; (b) eigenvector corresponding to the 
second smallest eigenvalue of the generalized eigenvalue problem ( D — W)y = A Dy. 


eigenvector values are associated with the two cut components. This process can be further 
repeated to hierarchically subdivide an image, as shown in Figure 5.21. 

The original algorithm proposed by Shi and Malik (2000) used spatial position and image 
feature differences to compute the pixel-wise affinities. 


wu = exp - 


\Fi~Fj \\ 2 


Xi - X, 


(5.48) 


for pixels within a radius \\xi — Xj || < r, where F is a feature vector that consists of intensi- 
ties, colors, or oriented filter histograms. (Note how (5.48) is the negative exponential of the 
joint feature space distance (5.42).) 

In subsequent work, Malik, Belongie, Leung el al. (2001) look for intervening contours 
between pixels i and j and define an intervening contour weight 


= 1 - max Pcon(x), (5.49) 

XGlij 

where Z,; 7 is the image line joining pixels i and j and p CO n(x) is the probability of an inter- 
vening contour perpendicular to this line, which is defined as the negative exponential of the 
oriented energy in the perpendicular direction. They multiply these weights with a texton- 
based texture similarity metric and use an initial over-segmentation based purely on local 
pixel-wise features to re-estimate intervening contours and texture statistics in a region-based 
manner. Figure 5.22 shows the results of running this improved algorithm on a number of 
test images. 

Because it requires the solution of large sparse eigenvalue problems, normalized cuts can 
be quite slow. Sharon, Galun, Sharon et al. (2006) present a way to accelerate the com- 
putation of the normalized cuts using an approach inspired by algebraic multigrid (Brandt 
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Figure 5.21 Normalized cuts segmentation (Shi and Malik 2000) © 2000 IEEE: The input 
image and the components returned by the normalized cuts algorithm. 



Figure 5.22 Comparative segmentation results (Alpert, Galun, Basri el al. 2007) © 2007 
IEEE. “Our method” refers to the probabilistic bottom-up merging algorithm developed by 
Alpert et al. 
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1986; Briggs, Henson, and McCormick 2000). To coarsen the original problem, they select 
a smaller number of variables such that the remaining tine-level variables are strongly cou- 
pled to at least one coarse-level variable. Figure 5.15 shows this process schematically, while 
(5.25) gives the definition for strong coupling except that, in this case, the original weights 
Wij in the normalized cut are used instead of merge probabilities p l:j . 

Once a set of coarse variables has been selected, an inter-level interpolation matrix with 
elements similar to the left hand side of (5.25) is used to define a reduced version of the nor- 
malized cuts problem. In addition to computing the weight matrix using interpolation-based 
coarsening, additional region statistics are used to modulate the weights. After a normalized 
cut has been computed at the coarsest level of analysis, the membership values of finer-level 
nodes are computed by interpolating parent values and mapping values within e = 0.1 of 0 
and 1 to pure Boolean values. 

An example of the segmentation produced by weighted aggregation (SWA) is shown in 
Figure 5.22, along with the most recent probabilistic bottom-up merging algorithm by Alpert, 
Galun, Basri et al. (2007), which was described in Section 5.2. In even more recent work, 
Wang and Oliensis (2010) show how to estimate statistics over segmentations (e.g., mean 
region size) directly from the affinity graph. They use this to produce segmentations that are 
more central with respect to other possible segmentations. 


5.5 Graph cuts and energy-based methods 

A common theme in image segmentation algorithms is the desire to group pixels that have 
similar appearance (statistics) and to have the boundaries between pixels in different regions 
be of short length and across visible discontinuities. If we restrict the boundary measurements 
to be between immediate neighbors and compute region membership statistics by summing 
over pixels, we can formulate this as a classic pixel-based energy function using either a 
variational formulation (regularization, see Section 3.7.1) or as a binary Markov random 
field (Section 3.7.2). 

Examples of the continuous approach include (Mumford and Shah 1989; Chan and Vese 
1992; Zhu and Yuille 1996; Tabb and Ahuja 1997) along with the level set approaches dis- 
cussed in Section 5.1.4. An early example of a discrete labeling problem that combines 
both region-based and boundary -based energy terms is the work of Leclerc (1989), who used 
minimum description length (MDL) coding to derive the energy function being minimized. 
Boykov and Funka-Lea (2006) present a wonderful survey of various energy-based tech- 
niques for binary object segmentation, some of which we discuss below. 

As we saw in Section 3.7.2, the energy corresponding to a segmentation problem can be 
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written (c.f. Equations (3.100) and (3.108-3.113)) as 

E(f) = '52 E r(i,j) + E b (i,j), (5.50) 

where the region term 

E r (i,j ) = Es(I(i,j ); R(f(i,j ))) 

is the negative log likelihood that pixel intensity (or color) I(i, j) is 
tics of region R(f(i,j)) and the boundary term 

Eb(i,j) = s x (i,j)5(f(i,j) - f(i+l,j)) + s y (i,j)6(f(i,j ) - f(i,j + 1)) (5.52) 

measures the inconsistency between A/j neighbors modulated by local horizontal and vertical 
smoothness terms s x (i,j) and s y (i,j). 

Region statistics can be something as simple as the mean gray level or color (Leclerc 
1989), in which case 

Es{I-,Hk) = \\I-Hkf. (5.53) 

Alternatively, they can be more complex, such as region intensity histograms (Boykov and 
Jolly 2001) or color Gaussian mixture models (Rother, Kolmogorov, and Blake 2004). For 
smoothness (boundary) terms, it is common to make the strength of the smoothness s x (i,j ) 
inversely proportional to the local edge strength (Boykov, Veksler, and Zabih 2001). 

Originally, energy-based segmentation problems were optimized using iterative gradient 
descent techniques, which were slow and prone to getting trapped in local minima. Boykov 
and Jolly (2001) were the first to apply the binary MRF optimization algorithm developed by 
Greig, Porteous, and Seheult (1989) to binary object segmentation. 

In this approach, the user first delineates pixels in the background and foreground regions 
using a few strokes of an image brush (Figure 3.61). These pixels then become the seeds that 
tie nodes in the S—T graph to the source and sink labels S and T (Figure 5.23a). Seed pixels 
can also be used to estimate foreground and background region statistics (intensity or color 
histograms). 

The capacities of the other edges in the graph are derived from the region and boundary 
energy terms, i.e., pixels that are more compatible with the foreground or background region 
get stronger connections to the respective source or sink; adjacent pixels with greater smooth- 
ness also get stronger links. Once the minimum-cut/maximum- flow problem has been solved 
using a polynomial time algorithm (Goldberg and Tarjan 1988; Boykov and Kolmogorov 
2004), pixels on either side of the computed cut are labeled according to the source or sink to 
which they remain connected (Figure 5.23b). While graph cuts is just one of several known 
techniques for MRF energy minimization (Appendix B.5.4), it is still the one most commonly 
used for solving binary MRF problems. 


(5.51) 

consistent with the statis- 
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(a) 


(b) 


Figure 5.23 Graph cuts for region segmentation (Boykov and Jolly 2001) © 2001 IEEE: (a) 
the energy function is encoded as a maximum flow problem; (b) the minimum cut determines 
the region boundary. 


Figure 5.24 GrabCut image segmentation (Rother, Kolmogorov, and Blake 2004) © 2004 
ACM: (a) the user draws a bounding box in red; (b) the algorithm guesses color distributions 
for the object and background and performs a binary segmentation; (c) the process is repeated 
with better region statistics. 

The basic binary segmentation algorithm of Boykov and Jolly (2001) has been extended 
in a number of directions. The GrabCut system of Rother, Kolmogorov, and Blake (2004) 
iteratively re-estimates the region statistics, which are modeled as a mixtures of Gaussians in 
color space. This allows their system to operate given minimal user input, such as a single 
bounding box (Figure 5.24a) — the background color model is initialized from a strip of pixels 
around the box outline. (The foreground color model is initialized from the interior pixels, 
but quickly converges to a better estimate of the object.) The user can also place additional 
strokes to refine the segmentation as the solution progresses. In more recent work, Cui, Yang, 
Wen et al. (2008) use color and edge models derived from previous segmentations of similar 
objects to improve the local models used in GrabCut. 

Another major extension to the original binary segmentation formulation is the addition of 





(a) 


(b) 


(O 
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(b) image 


(c) undir. result 


(d) dir. result 


Figure 5.25 Segmentation with a directed graph cut (Boykov and Funka-Lea 2006) © 2006 
Springer: (a) directed graph; (b) image with seed points; (c) the undirected graph incorrectly 
continues the boundary along the bright object; (d) the directed graph correctly segments the 
light gray region from its darker surround. 


directed edges, which allows boundary regions to be oriented, e.g., to prefer light to dark tran- 
sitions or vice versa (Kolmogorov and Boykov 2005). Figure 5.25 shows an example where 
the directed graph cut correctly segments the light gray liver from its dark gray surround. The 
same approach can be used to measure the flux exiting a region, i.e., the signed gradient pro- 
jected normal to the region boundary. Combining oriented graphs with larger neighborhoods 
enables approximating continuous problems such as those traditionally solved using level sets 
in the globally optimal graph cut framework (Boykov and Kolmogorov 2003; Kolmogorov 
and Boykov 2005). 

Even more recent developments in graph cut-based segmentation techniques include the 
addition of connectivity priors to force the foreground to be in a single piece (Vicente, Kol- 
mogorov, and Rother 2008) and shape priors to use knowledge about an object’s shape during 
the segmentation process (Lempitsky and Boykov 2007; Lempitsky, Blake, and Rother 2008). 

While optimizing the binary MRF energy (5.50) requires the use of combinatorial op- 
timization techniques, such as maximum flow, an approximate solution can be obtained by 
converting the binary energy terms into quadratic energy terms defined over a continuous 
[0, 1] random field, which then becomes a classical membrane-based regularization problem 
(3.100-3.102). The resulting quadratic energy function can then be solved using standard 
linear system solvers (3.102-3.103), although if speed is an issue, you should use multigrid 
or one of its variants (Appendix A. 5). Once the continuous solution has been computed, it 
can be thresholded at 0.5 to yield a binary segmentation. 

The [0, 1] continuous optimization problem can also be interpreted as computing the prob- 
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ability at each pixel that a random walker starting at that pixel ends up at one of the labeled 
seed pixels, which is also equivalent to computing the potential in a resistive grid where the 
resistors are equal to the edge weights (Grady 2006; Sinop and Grady 2007). A'-way seg- 
mentations can also be computed by iterating through the seed labels, using a binary problem 
with one label set to 1 and all the others set to 0 to compute the relative membership proba- 
bilities for each pixel. In follow-on work, Grady and Ali (2008) use a precomputation of the 
eigenvectors of the linear system to make the solution with a novel set of seeds faster, which 
is related to the Laplacian matting problem presented in Section 10.4.3 (Levin, Acha, and 
Lischinski 2008). Couprie, Grady, Najman el al. (2009) relate the random walker to water- 
sheds and other segmentation techniques. Singaraju, Grady, and Vidal (2008) add directed- 
edge constraints in order to support flux, which makes the energy piecewise quadratic and 
hence not solvable as a single linear system. The random walker algorithm can also be used 
to solve the Mumford-Shah segmentation problem (Grady and Alvino 2008) and to com- 
pute fast multigrid solutions (Grady 2008). A nice review of these techniques is given by 
Singaraju, Grady, Sinop el al. (2010). 

An even faster way to compute a continuous [0, 1] approximate segmentation is to com- 
pute weighted geodesic distances between the 0 and 1 seed regions (Bai and Sapiro 2009), 
which can also be used to estimate soft alpha mattes (Section 10.4.3). A related approach by 
Criminisi, Sharp, and Blake (2008) can be used to find fast approximate solutions to general 
binary Markov random field optimization problems. 

5.5.1 Application : Medical image segmentation 

One of the most promising applications of image segmentation is in the medical imaging 
domain, where it can be used to segment anatomical tissues for later quantitative analysis. 
Figure 5.25 shows a binary graph cut with directed edges being used to segment the liver tis- 
sue (light gray) from its surrounding bone (white) and muscle (dark gray) tissue. Figure 5.26 
shows the segmentation of bones in a 256 x 256 x 119 computed X-ray tomography (CT) 
volume. Without the powerful optimization techniques available in today’s image segmen- 
tation algorithms, such processing used to require much more laborious manual tracing of 
individual X-ray slices. 

The fields of medical image segmentation (Mclnerney and Terzopoulos 1996) and med- 
ical image registration (Kybic and Unser 2003) (Section 8.3.1) are rich research fields with 
their own specialized conferences, such as Medical Imaging Computing and Computer As- 
sisted Intervention (MICCAI), 11 and journals, such as Medical Image Analysis and IEEE 
Transactions on Medical Imaging. These can be great sources of references and ideas for 
research in this area. 

1 1 http://www.miccai.org/. 
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(a) (b) 


Figure 5.26 3D volumetric medical image segmentation using graph cuts (Boykov and 
Funka-Lea 2006) © 2006 Springer: (a) computed tomography (CT) slice with some seeds; 
(b) recovered 3D volumetric bone model (on a 256 x 256 x 119 voxel grid). 

5.6 Additional reading 

The topic of image segmentation is closely related to clustering techniques, which are treated 
in a number of monographs and review articles (Jain and Dubes 1988; Kaufman and Rousseeuw 
1990; Jain, Duin, and Mao 2000; Jain, Topchy, Law et al. 2004). Some early segmentation 
techniques include those describerd by Brice and Fennema (1970); Pavlidis (1977); Riseman 
and Arbib (1977); Ohlander, Price, and Reddy (1978); Rosenfeld and Davis (1979); Haralick 
and Shapiro (1985), while examples of newer techniques are developed by Leclerc (1989); 
Mumford and Shah (1989); Shi and Malik (2000); Felzenszwalb and Huttenlocher (2004b). 

Arbelaez, Maire, Fowlkes et al. (2010) provide a good review of automatic segmentation 
techniques and also compare their performance on the Berkeley Segmentation Dataset and 
Benchmark (Martin, Fowlkes, Tal et al. 2001). 12 Additional comparison papers and databases 
include those by Unnikrishnan, Pantofaru, and Hebert (2007); Alpert, Galun, Basri et al. 
(2007); Estrada and Jepson (2009). 

The topic of active contours has a long history, beginning with the seminal work on 
snakes and other energy-minimizing variational methods (Kass, Witkin, and Terzopoulos 
1988; Cootes, Cooper, Taylor et al. 1995; Blake and Isard 1998), continuing through tech- 
niques such as intelligent scissors (Mortensen and Barrett 1995, 1999; Perez, Blake, and 
Gangnet 2001), and culminating in level sets (Malladi, Sethian, and Vemuri 1995; Caselles, 
Kimmel, and Sapiro 1997; Sethian 1999; Paragios and Deriche 2000; Sapiro 2001; Osher and 
Paragios 2003; Paragios, Faugeras, Chan et al. 2005; Cremers, Rousson, and Deriche 2007; 
Rousson and Paragios 2008; Paragios and Sgallari 2009), which are currently the most widely 

12 http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/. 
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used active contour methods. 

Techniques for segmenting images based on local pixel similarities combined with ag- 
gregation or splitting methods include watersheds (Vincent and Soille 1991; Beare 2006; 
Arbelaez, Maire, Fowlkes et al. 2010), region splitting (Ohlander, Price, and Reddy 1978), 
region merging (Brice and Fennema 1970; Pavlidis and Liow 1990; Jain, Topchy, Law el al. 
2004), as well as graph-based and probabilistic multi-scale approaches (Felzenszwalb and 
Huttenlocher 2004b; Alpert, Galun, Basri el al. 2007). 

Mean-shift algorithms, which find modes (peaks) in a density function representation of 
the pixels, are presented by Comaniciu and Meer (2002); Paris and Durand (2007). Parametric 
mixtures of Gaussians can also be used to represent and segment such pixel densities (Bishop 
2006; Ma, Derksen, Hong et al. 2007). 

The seminal work on spectral (eigenvalue) methods for image segmentation is the nor- 
malized cut algorithm of Shi and Malik (2000). Related work includes that by Weiss (1999); 
Meila and Shi (2000, 2001); Malik, Belongie, Leung et al. (2001); Ng, Jordan, and Weiss 
(2001); Yu and Shi (2003); Cour, Benezit, and Shi (2005); Sharon, Galun, Sharon et al. 
(2006); Tolliver and Miller (2006); Wang and Oliensis (2010). 

Continuous-energy -based (variational) approaches to interactive segmentation include Leclerc 
(1989); Mumford and Shah (1989); Chan and Vese (1992); Zhu and Yuille (1996); Tabb and 
Ahuja (1997). Discrete variants of such problems are usually optimized using binary graph 
cuts or other combinatorial energy minimization methods (Boykov and Jolly 2001; Boykov 
and Kolmogorov 2003; Rother, Kolmogorov, and Blake 2004; Kolmogorov and Boykov 2005; 

Cui, Yang, Wen et al. 2008; Vicente, Kolmogorov, and Rother 2008; Lempitsky and Boykov 
2007; Lempitsky, Blake, and Rother 2008), although continuous optimization techniques fol- 
lowed by thresholding can also be used (Grady 2006; Grady and Ali 2008; Singaraju, Grady, 
and Vidal 2008; Criminisi, Sharp, and Blake 2008; Grady 2008; Bai and Sapiro 2009; Cou- 
prie, Grady, Najman et al. 2009). Boykov and Funka-Lea (2006) present a good survey of 
various energy-based techniques for binary object segmentation. 


5.7 Exercises 

Ex 5.1: Snake evolution Prove that, in the absence of external forces, a snake will always 
shrink to a small circle and eventually a single point, regardless of whether first- or second- 
order smoothness (or some combination) is used. 

(Hint: If you can show that the evolution of the x(s) and y(s) components are indepen- 
dent, you can analyze the ID case more easily.) 


Ex 5.2: Snake tracker Implement a snake-based contour tracker: 
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1 . Decide whether to use a large number of contour points or a smaller number interpo- 
lated with a B-spline. 

2. Define your internal smoothness energy function and decide what image-based attrac- 
tive forces to use. 

3. At each iteration, set up the banded linear system of equations (quadratic energy func- 
tion) and solve it using banded Cholesky factorization (Appendix A.4). 

Ex 5.3: Intelligent scissors Implement the intelligent scissors (live-wire) interactive seg- 
mentation algorithm (Mortensen and Barrett 1995) and design a graphical user interface 
(GUI) to let you draw such curves over an image and use them for segmentation. 

Ex 5.4: Region segmentation Implement one of the region segmentation algorithms de- 
scribed in this chapter. Some popular segmentation algorithms include: 

• k-means (Section 5.3.1); 

• mixtures of Gaussians (Section 5.3.1); 

• mean shift (Section 5.3.2) and Exercise 5.5; 

• normalized cuts (Section 5.4); 

• similarity graph-based segmentation (Section 5.2.4); 

• binary Markov random fields solved using graph cuts (Section 5.5). 

Apply your region segmentation to a video sequence and use it to track moving regions 
from frame to frame. 

Alternatively, test out your segmentation algorithm on the Berkeley segmentation database 
(Martin, Fowlkes, Tal et al. 2001). 

Ex 5.5: Mean shift Develop a mean-shift segmentation algorithm for color images (Co- 
maniciu and Meer 2002). 

1. Convert your image to L*a*b* space, or keep the original RGB colors, and augment 
them with the pixel (, x , y) locations. 

2. For every pixel (L, a, b. x, y), compute the weighted mean of its neighbors using either 
a unit ball (Epanechnikov kernel) or finite-radius Gaussian, or some other kernel of 
your choosing. Weight the color and spatial scales differently, e.g., using values of 
(h s , h r , M) = (16, 19, 40) as shown in Figure 5.18. 
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3. Replace the current value with this weighted mean and iterate until either the motion is 
below a threshold or a finite number of steps has been taken. 

4. Cluster all final values (modes) that are within a threshold, i.e., find the connected 
components. Since each pixel is associated with a final mean-shift (mode) value, this 
results in an image segmentation, i.e., each pixel is labeled with its final component. 

5. (Optional) Use a random subset of the pixels as starting points and find which com- 
ponent each unlabeled pixel belongs to, either by finding its nearest neighbor or by 
iterating the mean shift until it finds a neighboring track of mean-shift values. Describe 
the data structures you use to make this efficient. 

6. (Optional) Mean shift divides the kernel density function estimate by the local weight- 
ing to obtain a step size that is guaranteed to converge but may be slow. Use an alter- 
native step size estimation algorithm from the optimization literature to see if you can 
make the algorithm converge faster. 
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Figure 6.1 Geometric alignment and calibration: (a) geometric alignment of 2D images for 
stitching (Szeliski and Shum 1997) © 1997 ACM; (b) a two-dimensional calibration target 
(Zhang 2000) © 2000 IEEE; (c) calibration from vanishing points; (d) scene with easy-to- 
find lines and vanishing directions (Criminisi, Reid, and Zisserman 2000) © 2000 Springer. 
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Figure 6.2 Basic set of 2D planar transformations 


Once we have extracted features from images, the next stage in many vision algorithms is 
to match these features across different images (Section 4.1.3). An important component of 
this matching is to verify whether the set of matching features is geometrically consistent, 
e.g., whether the feature displacements can be described by a simple 2D or 3D geometric 
transformation. The computed motions can then be used in other applications such as image 
stitching (Chapter 9) or augmented reality (Section 6.2.3). 

In this chapter, we look at the topic of geometric image registration, i.e., the computation 
of 2D and 3D transformations that map features in one image to another (Section 6.1). One 
special case of this problem is pose estimation, which is determining a camera’s position 
relative to a known 3D object or scene (Section 6.2). Another case is the computation of a 
camera’s intrinsic calibration, which consists of the internal parameters such as focal length 
and radial distortion (Section 6.3). In Chapter 7, we look at the related problems of how 
to estimate 3D point structure from 2D matches ( triangulation ) and how to simultaneously 
estimate 3D geometry and camera motion (structure from motion). 


6.1 2D and 3D feature-based alignment 

Feature-based alignment is the problem of estimating the motion between two or more sets 
of matched 2D or 3D points. In this section, we restrict ourselves to global parametric trans- 
formations, such as those described in Section 2.1.2 and shown in Table 2.1 and Figure 6.2, 
or higher order transformation for curved surfaces (Shashua and Toelg 1997; Can, Stewart, 
Roysam et al. 2002). Applications to non-rigid or elastic deformations (Bookstein 1989; 
Szeliski and Lavallee 1996; Torresani, Hertzmann, and Bregler 2008) are examined in Sec- 
tions 8.3 and 12.6.4. 
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Table 6.1 Jacobians of the 2D coordinate transformations x' = f(x; p) shown in Table 2.1, 
where we have re -parameterized the motions so that they are identity for p = 0. 


6.1.1 2D alignment using least squares 

Given a set of matched feature points {(a;*, a;')} and a planar parametric transformation 1 of 
the form 

x' = f{x\p), (6.1) 

how can we produce the best estimate of the motion parameters p? The usual way to do this 
is to use least squares, i.e., to minimize the sum of squared residuals 

Els = INI 2 = II /<Np) - Nl 2 . (6-2) 

i i 

where 

Ti = f{xi ; p) - x\ = x\ - x'i (6.3) 

is the residual between the measured location x,- and its corresponding current predicted 
location x\ = fix,-, p). (See Appendix A. 2 for more on least squares and Appendix B.2 for 
a statistical justification.) 

1 For examples of non-planar parametric models, such as quadrics, see the work of Shashua and Toelg (1997); 

Shashua and Wexler (2001 ). 
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Many of the motion models presented in Section 2.1.2 and Table 2.1, i.e., translation, 
similarity, and affine, have a linear relationship between the amount of motion Ax = x' — x 
and the unknown parameters p , 


Ax = x' — x = J(x)p , (6.4) 

where J = df /dp is the Jacobian of the transformation / with respect to the motion param- 
eters p (see Table 6.1). In this case, a simple linear regression (linear least squares problem) 
can be formulated as 


-Ells = Y , || J(xj)p - Ax t 


(6.5) 


= P 


y^j T {xi)j(x.i) 


p-2p L 


Y. J T (xi)Axi 


= p T Ap — 2 p T b + c. 


■ Y \\ Ax Y ( 6 - 6 ) 

i 

(6.7) 


The minimum can be found by solving the symmetric positive definite (SPD) system of nor- 
mal equations 2 

Ap = b , (6.8) 

where 

A = Y jT ( x i) J (- *i) ( 6 - 9 ) 

i 

is called the Hessian and b = J T (xi)Axt. For the case of pure translation, the result- 
ing equations have a particularly simple form, i.e., the translation is the average translation 
between corresponding points or, equivalently, the translation of the point centroids. 


Uncertainty weighting. The above least squares formulation assumes that all feature points 
are matched with the same accuracy. This is often not the case, since certain points may fall 
into more textured regions than others. If we associate a scalar variance estimate of with 
each correspondence, we can minimize the weighted least squares problem instead, 3 

E W ls = 5>- 2 3 |N| 2 . (6 - 10 > 

i 

As shown in Section 8.1.3, a covariance estimate for patch-based matching can be obtained 
by multiplying the inverse of the patch Hessian Ai (8.55) with the per-pixel noise covariance 

2 For poorly conditioned problems, it is better to use QR decomposition on the set of linear equations J(xi)p = 
A Xi instead of the normal equations (Bjorck 1996; Golub and Van Loan 1996). However, such conditions rarely 
arise in image registration. 

3 Problems where each measurement can have a different variance or certainty are called heteroscedastic models. 
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Figure 6.3 A simple panograph consisting of three images automatically aligned with a 
translational model and then averaged together. 

o\ (8.44). Weighting each squared residual by its inverse covariance S” 1 = a~ 2 Ai (which 
is called the information matrix), we obtain 


6.1.2 Application : Panography 

One of the simplest (and most fun) applications of image alignment is a special form of image 
stitching called panography . In a panograph, images are translated and optionally rotated and 
scaled before being blended with simple averaging (Figure 6.3). This process mimics the 
photographic collages created by artist David Hockney, although his compositions use an 
opaque overlay model, being created out of regular photographs. 

In most of the examples seen on the Web, the images are aligned by hand for best artistic 
effect. 4 However, it is also possible to use feature matching and alignment techniques to 
perform the registration automatically (Nomura, Zhang, and Nayar 2007; Zelnik-Manor and 
Perona 2007). 

Consider a simple translational model. We want all the corresponding features in different 
images to line up as best as possible. Let t 3 be the location of the yth image coordinate frame 
in the global composite frame and Xij be the location of the 7th matched feature in the yth 
image. In order to align the images, we wish to minimize the least squares error 



( 6 . 11 ) 



( 6 . 12 ) 


4 http://www.flickr.com/groups/panography/. 
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where Xi is the consensus (average) position of feature i in the global coordinate frame. 
(An alternative approach is to register each pair of overlapping images separately and then 
compute a consensus location for each frame — see Exercise 6.2.) 

The above least squares problem is indeterminate (you can add a constant offset to all the 
frame and point locations t 3 and x,). To fix this, either pick one frame as being at the origin 
or add a constraint to make the average frame offsets be 0. 

The formulas for adding rotation and scale transformations are straightforward and are 
left as an exercise (Exercise 6.2). See if you can create some collages that you would be 
happy to share with others on the Web. 


6.1.3 Iterative algorithms 

While linear least squares is the simplest method for estimating parameters, most problems in 
computer vision do not have a simple linear relationship between the measurements and the 
unknowns. In this case, the resulting problem is called non-linear least squares or non-linear 
regression. 

Consider, for example, the problem of estimating a rigid Euclidean 2D transformation 
(translation plus rotation) between two sets of points. If we parameterize this transformation 
by the translation amount (t x ,t y ) and the rotation angle 6 , as in Table 2.1, the Jacobian of 
this transformation, given in Table 6.1, depends on the current value of 9. Notice how in 
Table 6.1, we have re-parameterized the motion matrices so that they are always the identity 
at the origin p = 0, which makes it easier to initialize the motion parameters. 

To minimize the non-linear least squares problem, we iteratively find an update A p to the 
current parameter estimate p by minimizing 


^nls(Ap) = ^2 \\f(x l ;p+ Ap) - x'| 

i 

~ '^2\\J(x i ;p)Ap-r l \\ 2 


= Ap 7 


J2j t j 


Ap — 2 A p 


E j2 


E 


= Ap 7 AAp — 2A p T b + c, 


where the “Hessian” 5 * A is the same as Equation (6.9) and the right hand side vector 


(6.13) 

(6.14) 

(6.15) 

(6.16) 


b = s ^J T (x i )r i (6.17) 

i 

5 The “Hessian” A is not the true Hessian (second derivative) of the non-linear least squares problem (6.13). 

Instead, it is the approximate Hessian, which neglects second (and higher) order derivatives of f(xi,p + Ap). 
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is now a Jacobian- weighted sum of residual vectors. This makes intuitive sense, as the pa- 
rameters are pulled in the direction of the prediction error with a strength proportional to the 
Jacobian. 

Once A and b have been computed, we solve for A p using 

(A + Adiag(A))Ap = b, (6.18) 


and update the parameter vector p <— p + A p accordingly. The parameter A is an addi- 
tional damping parameter used to ensure that the system takes a “downhill” step in energy 
(squared error) and is an essential component of the Levenberg-Marquardt algorithm (de- 
scribed in more detail in Appendix A. 3). In many applications, it can be set to 0 if the system 
is successfully converging. 

For the case of our 2D translation+rotation, we end up with a 3 x 3 set of normal equations 
in the unknowns (5t x ,5t y ,59). An initial guess for (t x ,t y ,8) can be obtained by fitting a 
four-parameter similarity transform in (t x ,t y ,c,s) and then setting 9 = tan -1 (s/c). An 
alternative approach is to estimate the translation parameters using the centroids of the 2D 
points and to then estimate the rotation angle using polar coordinates (Exercise 6.3). 

For the other 2D motion models, the derivatives in Table 6.1 are all fairly straightforward, 
except for the projective 2D motion (homography), which arises in image-stitching applica- 
tions (Chapter 9). These equations can be re-written from (2.21) in their new parametric form 
as 

, (1 + hoo)x + hoiy + ho 2 , , h\oX + (1 + hn)y + hi 2 

x = and y = . (6.19) 

J 120 X + h 2 ry + 1 h 2 ox + h 2 iy + i 

The Jacobian is therefore 


df 1 xylOOO — x'x —x'y 

dp D 0 0 0 x y 1 — y'x —y'y 


( 6 . 20 ) 


where D = h 2 QX + h 2 \ y + 1 is the denominator in (6.19), which depends on the current 
parameter settings (as do x' and y'). 

An initial guess for the eight unknowns {(loch ^ 01 , ■ ■ • , ( 121 } can be obtained by multiply- 
ing both sides of the equations in (6.19) through by the denominator, which yields the linear 
set of equations, 


' x' 
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( 6 . 21 ) 


However, this is not optimal from a statistical point of view, since the denominator D, which 
was used to multiply each equation, can vary quite a bit from point to point. 6 


Hartley and Zisserman (2004) call this strategy of forming linear equations from rational equations the direct 
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One way to compensate for this is to reweight each equation by the inverse of the current 
estimate of the denominator, D, 
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( 6 . 22 ) 


While this may at first seem to be the exact same set of equations as (6.21), because least 
squares is being used to solve the over-determined set of equations, the weightings do matter 
and produce a different set of normal equations that performs better in practice. 

The most principled way to do the estimation, however, is to directly minimize the squared 
residual equations (6.13) using the Gauss-Newton approximation, i.e., performing a first- 
order Taylor series expansion in p, as shown in (6. 14), which yields the set of equations 


x' 

— x' 

1 

X 

y 

1 

0 

0 

0 

—x'x 

-x'y 

1 

*5> 

1 

1 

” D 

0 

0 

0 

X 

y 

1 

- y'x 

1 

*34 

*5* 

1 


Ah 00 

Ah.21 


(6.23) 


While these look similar to (6.22), they differ in two important respects. First, the left hand 
side consists of unweighted prediction errors rather than point displacements and the solution 
vector is a perturbation to the parameter vector p. Second, the quantities inside J involve 
predicted feature locations (x 1 , y') instead of sensed feature locations (x' , y'). Both of these 
differences are subtle and yet they lead to an algorithm that, when combined with proper 
checking for downhill steps (as in the Levenberg-Marquardt algorithm), will converge to a 
local minimum. Note that iterating Equations (6.22) is not guaranteed to converge, since it is 
not minimizing a well-defined energy function. 

Equation (6.23) is analogous to the additive algorithm for direct intensity-based regis- 
tration (Section 8.2), since the change to the full transformation is being computed. If we 
prepend an incremental homography to the current homography instead, i.e., we use a com- 
positional algorithm (described in Section 8.2), we get D = 1 (since p = 0) and the above 
formula simplifies to 
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(6.24) 


where we have replaced (x' , y') with (x, y) for conciseness. (Notice how this results in the 
same Jacobian as (8.63).) 


linear transform, but that term is more commonly associated with pose estimation (Section 6 . 2 ). Note also that our 
definition of the hij parameters differs from that used in their book, since we define ha to be the difference from 
unity and we do not leave /122 as a free parameter, which means that we cannot handle certain extreme homographies. 
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6.1.4 Robust least squares and RANSAC 


While regular least squares is the method of choice for measurements where the noise follows 
a normal (Gaussian) distribution, more robust versions of least squares are required when 
there are outliers among the correspondences (as there almost always are). In this case, it is 
preferable to use an M-estimator (Huber 1981; Hampel, Ronchetti, Rousseeuw et al. 1986; 
Black and Rangarajan 1996; Stewart 1999), which involves applying a robust penalty function 
p(r) to the residuals 

£rls(Ap) = £>(||ri||) (6.25) 

i 

instead of squaring them. 

We can take the derivative of this function with respect to p and set it to 0, 


£^(IWI) 


d|NI 

dp 


v- iHINI) T dr i = n 

V INI ' °p ’ 


(6.26) 


where = p'{r) is the derivative of p and is called the influence function. If we introduce 
a weight function, w(r ) = 'k (r)/r, we observe that finding the stationary point of (6.25) using 
(6.26) is equivalent to minimizing the iteratively reweighted least squares (IRLS) problem 


£irls = £ W (IN|)|N| 2 , (6.27) 

i 

where the u;(||rj||) play the same local weighting role as a ~ 2 in (6.10). The IRLS algo- 
rithm alternates between computing the influence functions w;(||rj||) and solving the result- 
ing weighted least squares problem (with fixed w values). Other incremental robust least 
squares algorithms can be found in the work of Sawhney and Ayer (1996); Black and Anan- 
dan (1996); Black and Rangarajan (1996); Baker, Gross, Ishikawa et al. (2003) and textbooks 
and tutorials on robust statistics (Huber 1981; Hampel, Ronchetti, Rousseeuw el al. 1986; 
Rousseeuw and Leroy 1987; Stewart 1999). 

While M-estimators can definitely help reduce the influence of outliers, in some cases, 
starting with too many outliers will prevent IRLS (or other gradient descent algorithms) from 
converging to the global optimum. A better approach is often to find a starting set of inlie r 
correspondences, i.e., points that are consistent with a dominant motion estimate. 7 

Two widely used approaches to this problem are called RANdom SAmple Consensus, or 
RANSAC for short (Fischler and Bolles 1981), and least median of squares (LMS) (Rousseeuw 
1984). Both techniques start by selecting (at random) a subset of k correspondences, which is 

7 For pixel-based alignment methods (Section 8.1.1), hierarchical (coarse-to-fine) techniques are often used to 
lock onto the dominant motion in a scene. 
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then used to compute an initial estimate for p. The residuals of the full set of correspondences 
are then computed as 

Vi = Xi(xi-,p) - x'i, (6.28) 

where x\ are the estimated (mapped) locations and x i are the sensed (detected) feature point 
locations. 

The RANSAC technique then counts the number of inliers that are within e of their pre- 
dicted location, i.e., whose INI < e. (The e value is application dependent but is often 
around 1-3 pixels.) Least median of squares finds the median value of the |r, || 2 values. The 
random selection process is repeated S times and the sample set with the largest number of 
inliers (or with the smallest median residual) is kept as the final solution. Either the initial 
parameter guess p or the full set of computed inliers is then passed on to the next data fitting 
stage. 

When the number of measurements is quite large, it may be preferable to only score a 
subset of the measurements in an initial round that selects the most plausible hypotheses for 
additional scoring and selection. This modification of RANSAC, which can significantly 
speed up its performance, is called Preemptive RANSAC (Nister 2003). In another variant 
on RANSAC called PROSAC (PROgressive SAmple Consensus), random samples are ini- 
tially added from the most “confident” matches, thereby speeding up the process of finding a 
(statistically) likely good set of inliers (Chum and Matas 2005). 

To ensure that the random sampling has a good chance of finding a true set of inliers, a 
sufficient number of trials S must be tried. Let p be the probability that any given correspon- 
dence is valid and P be the total probability of success after S trials. The likelihood in one 
trial that all k random samples are inliers is p k . Therefore, the likelihood that S such trials 
will all fail is 

l-P=(l-p k ) s (6.29) 


and the required minimum number of trials is 


log(l ~ P) 

log(l - p k ) ' 


(6.30) 


Stewart (1999) gives examples of the required number of trials S to attain a 99% proba- 
bility of success. As you can see from Table 6.2, the number of trials grows quickly with the 
number of sample points used. This provides a strong incentive to use the minimum number 
of sample points k possible for any given trial, which is how RANSAC is normally used in 
practice. 


Uncertainty modeling 

In addition to robustly computing a good alignment, some applications require the compu- 
tation of uncertainty (see Appendix B.6). For linear problems, this estimate can be obtained 
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k p S 


3 0.5 35 

6 0.6 97 

6 0.5 293 


Table 6.2 Number of trials S to attain a 99% probability of success (Stewart 1999). 

by inverting the Hessian matrix (6.9) and multiplying it by the feature position noise (if these 
have not already been used to weight the individual measurements, as in Equations (6.10) 
and 6. 1 1)). In statistics, the Hessian, which is the inverse covariance, is sometimes called the 
(Fisher) information matrix (Appendix B.1.1). 

When the problem involves non-linear least squares, the inverse of the Hessian matrix 
provides the Cramer-Rao lower bound on the covariance matrix, i.e., it provides the minimum 
amount of covariance in a given solution, which can actually have a wider spread (“longer 
tails”) if the energy flattens out away from the local minimum where the optimal solution is 


6.1.5 3D alignment 

Instead of aligning 2D sets of image features, many computer vision applications require the 
alignment of 3D points. In the case where the 3D transformations are linear in the motion 
parameters, e.g., for translation, similarity, and affine, regular least squares (6.5) can be used. 

The case of rigid (Euclidean) motion. 


which arises more frequently and is often called the absolute orientation problem (Horn 
1987), requires slightly different techniques. If only scalar weightings are being used (as 
opposed to full 3D per-point anisotropic covariance estimates), the weighted centroids of the 


left with the problem of estimating the rotation between two sets of points { x, = x, — c\ 
and {x[ = x'i — c'} that are both centered at the origin. 

One commonly used technique is called the orthogonal Procrustes algorithm (Golub and 
Van Loan 1996, p. 601) and involves computing the singular value decomposition (SVD) of 

8 When full covariances are used, they are transformed by the rotation and so a closed-form solution for transla- 
tion is not possible. 


found. 



(6.31) 


two point clouds c and d can be used to estimate the translation t = d — Ref We are then 
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the 3x3 correlation matrix 

C = £ xx T = UT,V t . (6.32) 

i 

The rotation matrix is then obtained as R = UV T . (Verify this for yourself when x = Rx.) 

Another technique is the absolute orientation algorithm (Horn 1987) for estimating the 
unit quaternion corresponding to the rotation matrix R, which involves forming a 4 x 4 matrix 
from the entries in C and then finding the eigenvector associated with its largest positive 
eigenvalue. 

Lorusso, Eggert, and Fisher (1995) experimentally compare these two techniques to two 
additional techniques proposed in the literature, but find that the difference in accuracy is 
negligible (well below the effects of measurement noise). 

In situations where these closed-form algorithms are not applicable, e.g., when full 3D 
covariances are being used or when the 3D alignment is part of some larger optimization, the 
incremental rotation update introduced in Section 2.1.4 (2.35-2.36), which is parameterized 
by an instantaneous rotation vector u>, can be used (See Section 9.1.3 for an application to 
image stitching.) 

In some situations, e.g., when merging range data maps, the correspondence between 
data points is not known a priori. In this case, iterative algorithms that start by matching 
nearby points and then update the most likely correspondence can be used (Besl and McKay 
1992; Zhang 1994; Szeliski and Lavallee 1996; Gold, Rangarajan, Lu et al. 1998; David, 
DeMenthon, Duraiswami et al. 2004; Li and Hartley 2007; Enqvist, Josephson, and Kahl 
2009). These techniques are discussed in more detail in Section 12.2.1. 

6.2 Pose estimation 

A particular instance of feature-based alignment, which occurs very often, is estimating an 
object’s 3D pose from a set of 2D point projections. This pose estimation problem is also 
known as extrinsic calibration, as opposed to the intrinsic calibration of internal camera pa- 
rameters such as focal length, which we discuss in Section 6.3. The problem of recovering 
pose from three correspondences, which is the minimal amount of information necessary, 
is known as the perspective-3-point-problem (P3P), with extensions to larger numbers of 
points collectively known as PnP (Haralick, Lee, Ottenberg et al. 1994; Quan and Lan 1999; 
Moreno-Noguer, Lepetit, and Fua 2007). 

In this section, we look at some of the techniques that have been developed to solve such 
problems, starting with the direct linear transform (DLT), which recovers a 3 x 4 camera ma- 
trix, followed by other “linear” algorithms, and then looking at statistically optimal iterative 
algorithms. 
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6.2.1 Linear algorithms 

The simplest way to recover the pose of the camera is to form a set of linear equations analo- 
gous to those used for 2D motion estimation (6. 19) from the camera matrix form of perspec- 
tive projection (2.55-2.56), 


PaoXi + PoiYi + P02Zi + P03 

x ■ — 

P2()Xi + P2lYi + P22%i + P23 

(6.33) 

PioXi + puYi + pi 2 Zi + P13 

(6.34) 

P2QX i +p 2 {Y i +P22.Z i +P23 


where ( Xi,yi ) are the measured 2D feature locations and (X,. Y,. Z,) are the known 3D 
feature locations (Figure 6.4). As with (6.21), this system of equations can be solved in a 
linear fashion for the unknowns in the camera matrix P by multiplying the denominator on 
both sides of the equation. 1 ’ The resulting algorithm is called the direct linear transform 
(DLT) and is commonly attributed to Sutherland (1974). (For a more in-depth discussion, 
refer to the work of Hartley and Zisserman (2004).) In order to compute the 12 (or 11) 
unknowns in P, at least six correspondences between 3D and 2D locations must be known. 

As with the case of estimating homographies (6.21-6.23), more accurate results for the 
entries in P can be obtained by directly minimizing the set of Equations (6.33-6.34) using 
non-linear least squares with a small number of iterations. 

Once the entries in P have been recovered, it is possible to recover both the intrinsic 
calibration matrix K and the rigid transformation (II, t) by observing from Equation (2.56) 
that 

P = K[R\t}. (6.35) 

Since K is by convention upper-triangular (see the discussion in Section 2.1.5), both K and 
R can be obtained from the front 3x3 sub-matrix of P using RQ factorization (Golub and 
Van Loan 1996). 9 10 

In most applications, however, we have some prior knowledge about the intrinsic cali- 
bration matrix K, e.g., that the pixels are square, the skew is very small, and the optical 
center is near the center of the image (2.57-2.59). Such constraints can be incorporated into 
a non-linear minimization of the parameters in K and (R,t), as described in Section 6.2.2. 

In the case where the camera is already calibrated, i.e., the matrix K is known (Sec- 
tion 6.3), we can perform pose estimation using as few as three points (Fischler and Bolles 
1981; Haralick, Lee, Ottenberg et al. 1994; Quan and Lan 1999). The basic observation that 
these linear PnP (perspective n-point) algorithms employ is that the visual angle between any 

9 Because P is unknown up to a scale, we can either fix one of the entries, e.g., P23 = 1, or find the smallest 
singular vector of the set of linear equations. 

10 Note the unfortunate clash of terminologies: In matrix algebra textbooks, R represents an upper-triangular 
matrix; in computer vision, R is an orthogonal rotation. 
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Figure 6.4 Pose estimation by the direct linear transform and by measuring visual angles 
and distances between pairs of points. 


pair of 2D points tCj and Xj must be the same as the angle between their corresponding 3D 
points p t and p j (Figure 6.4). 

Given a set of corresponding 2D and 3D points { (x t . p. t ) } , where the x, are unit directions 
obtained by transforming 2D pixel measurements Xi to unit norm 3D directions x, through 
the inverse calibration matrix K , 

*i = J\f(K~ 1 Xi) = K~ 1 Xi/\\K~ 1 Xi\\ 1 (6.36) 

the unknowns are the distances di from the camera origin c to the 3D points p t , where 

Pi = diXi + c (6.37) 

(Figure 6.4). The cosine law for triangle A(c, p t . p.j ) gives us 

fij{di, dj ) = di + d 2 j - 2didjCij - d^ = 0, (6.38) 

where 

c-ij = cos Oij = Xi ■ Xj (6.39) 

and 

djj = \\Pi - Pj\\ 2 ■ (6.40) 

We can take any triplet of constraints (fij, fik, fjk) and eliminate the dj and dk using 
Sylvester resultants (Cox, Little, and O’Shea 2007) to obtain a quartic equation in cif , 

9 ijk(df) = a±d^ + a 3 d<? + 02 ^ + aid"* + ag = 0. (6.41) 

Given five or more correspondences, we can generate triplets to obtain a linear 

estimate (using SVD) for the values of (df , d®, df, d'f ) (Quan and Lan 1999). Estimates for 
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d 2 can computed as ratios of successive d^ n+2 /d 2n estimates and these can be averaged to 
obtain a final estimate of d 2 (and hence d, ). 

Once the individual estimates of the di distances have been computed, we can generate 
a 3D structure consisting of the scaled point directions diXi, which can then be aligned with 
the 3D point cloud {p , } using absolute orientation (Section 6.1.5) to obtained the desired 
pose estimate. Quan and Lan (1999) give accuracy results for this and other techniques, 
which use fewer points but require more complicated algebraic manipulations. The paper by 
Moreno-Noguer, Lepetit, and Fua (2007) reviews more recent alternatives and also gives a 
lower complexity algorithm that typically produces more accurate results. 

Unfortunately, because minimal PnP solutions can be quite noise sensitive and also suffer 
from bas-relief ambiguities (e.g., depth reversals) (Section 7.4.3), it is often preferable to use 
the linear six-point algorithm to guess an initial pose and then optimize this estimate using 
the iterative technique described in Section 6.2.2. 

An alternative pose estimation algorithm involves starting with a scaled orthographic pro- 
jection model and then iteratively refining this initial estimate using a more accurate perspec- 
tive projection model (DeMenthon and Davis 1995). The attraction of this model, as stated 
in the paper’s title, is that it can be implemented “in 25 lines of [Mathematical code”. 

6.2.2 Iterative algorithms 

The most accurate (and flexible) way to estimate pose is to directly minimize the squared (or 
robust) reprojection error for the 2D points as a function of the unknown pose parameters in 
(R,t) and optionally K using non-linear least squares (Tsai 1987; Bogart 1991; Gleicher 
and Witkin 1992). We can write the projection equations as 


where r, = x t - x, is the current residual vector (2D error in predicted position) and the 
partial derivatives are with respect to the unknown pose parameters (rotation, translation, and 
optionally calibration). Note that if full 2D covariance estimates are available for the 2D 
feature locations, the above squared norm can be weighted by the inverse point covariance 
matrix, as in Equation (6.11). 

An easier to understand (and implement) version of the above non-linear regression prob- 
lem can be constructed by re-writing the projection equations as a concatenation of simpler 
steps, each of which transforms a 4D homogeneous coordinate p, by a simple transformation 


Xi = f(pp R, t, K) 


(6.42) 


and iteratively minimize the robustified linearized reprojection errors 



(6.43) 
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k qj cj 

Figure 6.5 A set of chained transforms for projecting a 3D point p t to a 2D measurement x, 
through a series of transformations f^ k \ each of which is controlled by its own set of param- 
eters. The dashed lines indicate the flow of information as partial derivatives are computed 
during a backward pass. 


such as translation, rotation, or perspective division (Figure 6.5). The resulting projection 
equations can be written as 


,(1) 

= f T (Pi', c i) =Pi~ C P 

(6.44) 

.(2) 

= = R(qj)y {1) , 

(6.45) 

(3) 

= /p(*a = ^ 

(6.46) 

Xi 

= f c (y {3) -,k) = K(k)yW. 

(6.47) 


Note that in these equations, we have indexed the camera centers Cj and camera rotation 
quaternions q j by an index j, in case more than one pose of the calibration object is being 
used (see also Section 7.4.) We are also using the camera center Cj instead of the world 
translation tj , since this is a more natural parameter to estimate. 

The advantage of this chained set of transformations is that each one has a simple partial 
derivative with respect both to its parameters and to its input. Thus, once the predicted value 
of Xi has been computed based on the 3D point location p i and the current values of the pose 
parameters ( Cj . q ■ . fe), we can obtain all of the required partial derivatives using the chain 
rule 


dri dri dy ^ 
dp( k ) dp( k ) ’ 


(6.48) 


where p lk) indicates one of the parameter vectors that is being optimized. (This same “trick” 
is used in neural networks as part of the backpropagalion algorithm (Bishop 2006).) 

The one special case in this formulation that can be considerably simplified is the compu- 
tation of the rotation update. Instead of directly computing the derivatives of the 3x3 rotation 
matrix R(q) as a function of the unit quaternion entries, you can prepend the incremental ro- 
tation matrix A R(uj) given in Equation (2.35) to the current rotation matrix and compute the 
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(a) (b) (c) (d) 


Figure 6.6 The VideoMouse can sense six degrees of freedom relative to a specially printed 
mouse pad using its embedded camera (Hinckley, Sinclair, Hanson et al. 1999) © 1999 
ACM: (a) top view of the mouse; (b) view of the mouse showing the curved base for rocking; 
(c) moving the mouse pad with the other hand extends the interaction capabilities; (d) the 
resulting movement seen on the screen. 


partial derivative of the transform with respect to these parameters, which results in a simple 
cross product of the backward chaining partial derivative and the outgoing 3D vector (2.36). 

6.2.3 Application : Augmented reality 

A widely used application of pose estimation is augmented reality, where virtual 3D images 
or annotations are superimposed on top of a live video feed, either through the use of see- 
through glasses (a head-mounted display) or on a regular computer or mobile device screen 
(Azuma, Baillot, Behringer et al. 2001; Haller, Billinghurst, and Thomas 2007). In some 
applications, a special pattern printed on cards or in a book is tracked to perform the aug- 
mentation (Kato, Billinghurst, Poupyrev et al. 2000; Billinghurst, Kato, and Poupyrev 2001). 
For a desktop application, a grid of dots printed on a mouse pad can be tracked by a camera 
embedded in an augmented mouse to give the user control of a full six degrees of freedom 
over their position and orientation in a 3D space (Hinckley, Sinclair, Hanson et al. 1999), as 
shown in Figure 6.6. 

Sometimes, the scene itself provides a convenient object to track, such as the rectangle 
defining a desktop used in through-the-lens camera control (Gleicher and Witkin 1992). In 
outdoor locations, such as film sets, it is more common to place special markers such as 
brightly colored balls in the scene to make it easier to find and track them (Bogart 1991). In 
older applications, surveying techniques were used to determine the locations of these balls 
before filming. Today, it is more common to apply structure-from-motion directly to the film 
footage itself (Section 7.4.2). 

Rapid pose estimation is also central to tracking the position and orientation of the hand- 
held remote controls used in Nintendo’s Wii game systems. A high-speed camera embedded 
in the remote control is used to track the locations of the infrared (IR) LEDs in the bar that 
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is mounted on the TV monitor. Pose estimation is then used to infer the remote control’s 
location and orientation at very high frame rates. The Wii system can be extended to a variety 
of other user interaction applications by mounting the bar on a hand-held device, as described 
by Johnny Lee. 11 

Exercises 6.4 and 6.5 have you implement two different tracking and pose estimation sys- 
tems for augmented-reality applications. The first system tracks the outline of a rectangular 
object, such as a book cover or magazine page, and the second has you track the pose of a 
hand-held Rubik’s cube. 


6.3 Geometric intrinsic calibration 

As described above in Equations (6.42-6.43), the computation of the internal (intrinsic) cam- 
era calibration parameters can occur simultaneously with the estimation of the (extrinsic) 
pose of the camera with respect to a known calibration target. This, indeed, is the “classic” 
approach to camera calibration used in both the photogrammetry (Slama 1980) and the com- 
puter vision (Tsai 1987) communities. In this section, we look at alternative formulations 
(which may not involve the full solution of a non-linear regression problem), the use of alter- 
native calibration targets, and the estimation of the non-linear part of camera optics such as 
radial distortion. 12 


6.3.1 Calibration patterns 

The use of a calibration pattern or set of markers is one of the more reliable ways to estimate 
a camera’s intrinsic parameters. In photogrammetry, it is common to set up a camera in a 
large field looking at distant calibration targets whose exact location has been precomputed 
using surveying equipment (Slama 1980; Atkinson 1996; Kraus 1997). In this case, the trans- 
lational component of the pose becomes irrelevant and only the camera rotation and intrinsic 
parameters need to be recovered. 

If a smaller calibration rig needs to be used, e.g., for indoor robotics applications or for 
mobile robots that carry their own calibration target, it is best if the calibration object can span 
as much of the workspace as possible (Figure 6.8a), as planar targets often fail to accurately 
predict the components of the pose that lie far away from the plane. A good way to determine 
if the calibration has been successfully performed is to estimate the covariance in the param- 
eters (Section 6.1.4) and then project 3D points from various points in the workspace into the 
image in order to estimate their 2D positional uncertainty. 

1 1 http://johnnylee.net/projects/wii/. 

12 In some applications, you can use the EXIF tags associated with a JPEG image to obtain a rough estimate of a 
camera’s focal length but this technique should be used with caution as the results are often inaccurate. 
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(a) 


(b) 


Figure 6.7 Calibrating a lens by drawing straight lines on cardboard (Debevec, Wenger, 
Tchou et al. 2002) © 2002 ACM: (a) an image taken by the video camera showing a hand 
holding a metal ruler whose right edge appears vertical in the image; (b) the set of lines drawn 
on the cardboard converging on the front nodal point (center of projection) of the lens and 
indicating the horizontal field of view. 


An alternative method for estimating the focal length and center of projection of a lens 
is to place the camera on a large flat piece of cardboard and use a long metal ruler to draw 
lines on the cardboard that appear vertical in the image, as shown in Figure 6.7a (Debevec, 
Wenger, Tchou et al. 2002). Such lines lie on planes that are parallel to the vertical axis of 
the camera sensor and also pass through the lens’ front nodal point. The location of the nodal 
point (projected vertically onto the cardboard plane) and the horizontal field of view (deter- 
mined from lines that graze the left and right edges of the visible image) can be recovered by 
intersecting these lines and measuring their angular extent (Figure 6.7b). 

If no calibration pattern is available, it is also possible to perform calibration simulta- 
neously with structure and pose recovery (Sections 6.3.4 and 7.4), which is known as self- 
calibration (Faugeras, Luong, and Maybank 1992; Hartley and Zisserman 2004; Moons, Van 
Gool, and Vergauwen 2010). However, such an approach requires a large amount of imagery 
to be accurate. 

Planar calibration patterns 

When a finite workspace is being used and accurate machining and motion control platforms 
are available, a good way to perform calibration is to move a planar calibration target in a 
controlled fashion through the workspace volume. This approach is sometimes called the N- 
planes calibration approach (Gremban, Thorpe, and Kanade 1988; Champleboux, Lavallee, 
Szeliski et al. 1992; Grossberg and Nayar 2001) and has the advantage that each camera pixel 
can be mapped to a unique 3D ray in space, which takes care of both linear effects modeled 




6.3 Geometric intrinsic calibration 


329 



(a) (b) 

Figure 6.8 Calibration patterns: (a) a three-dimensional target (Quan and Lan 1999) © 1999 
IEEE; (b) a two-dimensional target (Zhang 2000) © 2000 IEEE. Note that radial distoition 
needs to be removed from such images before the feature points can be used for calibration. 


by the calibration matrix K and non-linear effects such as radial distortion (Section 6.3.5). 

A less cumbersome but also less accurate calibration can be obtained by waving a pla- 
nar calibration pattern in front of a camera (Figure 6.8b). In this case, the pattern’s pose 
has (in principle) to be recovered in conjunction with the intrinsics. In this technique, each 
input image is used to compute a separate homography (6.19-6.23) H mapping the plane’s 
calibration points (©, Yj, 0) into image coordinates ( Xi , yf), 


Xi = 

Xi 

Vi 

~ K 

r 0 r x t 


1 

1 


1 




1 


(6.49) 


where the ri are the first two columns of R and ~ indicates equality up to scale. From 
these, Zhang (2000) shows how to form linear constraints on the nine entries in the B = 
K r K l matrix, from which the calibration matrix K can be recovered using a matrix 
square root and inversion. (The matrix B is known as the image of the absolute conic (IAC) 
in projective geometry and is commonly used for camera calibration (Hartley and Zisserman 
2004, Section 7.5).) If only the focal length is being recovered, the even simpler approach of 
using vanishing points can be used instead. 


6.3.2 Vanishing points 

A common case for calibration that occurs often in practice is when the camera is looking at 
a man-made scene with strong extended rectahedral objects such as boxes or room walls. In 
this case, we can intersect the 2D lines corresponding to 3D parallel lines to compute their 
vanishing points, as described in Section 4.3.3, and use these to determine the intrinsic and 
extrinsic calibration parameters (Caprile and Torre 1990; Becker and Bove 1995; Liebowitz 
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Figure 6.9 Calibration from vanishing points: (a) any pair of finite vanishing points (xi, x ;i ) 
can be used to estimate the focal length; (b) the orthocenter of the vanishing point triangle 
gives the optical center of the image c. 


and Zisserman 1998; Cipolla, Drummond, and Robertson 1999; Antone and Teller 2002; 
Criminisi, Reid, and Zisserman 2000; Hartley and Zisserman 2004; Pflugfelder 2008). 

Let us assume that we have detected two or more orthogonal vanishing points, all of which 
ar e finite, i.e., they are not obtained from lines that appear to be parallel in the image plane 
(Figure 6.9a). Let us also assume a simplified form for the calibration matrix K where only 
the focal length is unknown (2.59). (It is often safe for rough 3D modeling to assume that 
the optical center is at the center of the image, that the aspect ratio is 1, and that there is no 
skew.) In this case, the projection equation for the vanishing points can be written as 


Xi = 


Xi c x 

Ui Cy 


RPi = r it 


(6.50) 


where p t corresponds to one of the cardinal directions (1, 0, 0), (0, 1, 0), or (0, 0, 1), and r, 
is the ith column of the rotation matrix R. 

From the orthogonality between columns of the rotation matrix, we have 


7*2 * Tj ~ {Xi Cx)i^Xj Cy ) T ( jji Cy) {%)j Cy ) f 0 (6.51) 

from which we can obtain an estimate for f 2 . Note that the accuracy of this estimate increases 
as the vanishing points move closer to the center of the image. In other words, it is best to tilt 
the calibration pattern a decent amount around the 45° axis, as in Figure 6.9a. Once the focal 
length / has been determined, the individual columns of R can be estimated by normalizing 
the left hand side of (6.50) and taking cross products. Alternatively, an SVD of the initial R 
estimate, which is a variant on orthogonal Procrustes (6.32), can be used. 

If all three vanishing points are visible and finite in the same image, it is also possible to 
estimate the optical center as the orthocenter of the triangle formed by the three vanishing 
points (Caprile and Torre 1990; Hartley and Zisserman 2004, Section 7.6) (Figure 6.9b). 
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(a) (b) 


Figure 6.10 Single view metrology (Criminisi, Reid, and Zisserman 2000) © 2000 
Springer: (a) input image showing the three coordinate axes computed from the two hori- 
zontal vanishing points (which can be determined from the sidings on the shed); (b) a new 
view of the 3D reconstruction. 

In practice, however, it is more accurate to re-estimate any unknown intrinsic calibration 
parameters using non-linear least squares (6.42). 

6.3.3 Application : Single view metrology 

A fun application of vanishing point estimation and camera calibration is the single view 
metrology system developed by Criminisi, Reid, and Zisserman (2000). Their system allows 
people to interactively measure heights and other dimensions as well as to build piecewise- 
planar 3D models, as shown in Figure 6.10. 

The first step in their system is to identify two orthogonal vanishing points on the ground 
plane and the vanishing point for the vertical direction, which can be done by drawing some 
parallel sets of lines in the image. (Alternatively, automated techniques such as those dis- 
cussed in Section 4.3.3 or by Schaffalitzky and Zisserman (2000) could be used.) The user 
then marks a few dimensions in the image, such as the height of a reference object, and 
the system can automatically compute the height of another object. Walls and other planar 
impostors (geometry) can also be sketched and reconstructed. 

In the formulation originally developed by Criminisi, Reid, and Zisserman (2000), the 
system produces an affine reconstruction, i.e., one that is only known up to a set of indepen- 
dent scaling factors along each axis. A potentially more useful system can be constructed by 
assuming that the camera is calibrated up to an unknown focal length, which can be recov- 
ered from orthogonal (finite) vanishing directions, as we just described in Section 6.3.2. Once 
this is done, the user can indicate an origin on the ground plane and another point a known 
distance away. From this, points on the ground plane can be directly projected into 3D and 



332 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



Figure 6.11 Four images taken with a hand-held camera registered using a 3D rotation 
motion model, which can be used to estimate the focal length of the camera (Szeliski and 
Shum 1997) © 2000 ACM. 


points above the ground plane, when paired with their ground plane projections, can also be 
recovered. A fully metric reconstruction of the scene then becomes possible. 

Exercise 6.9 has you implement such a system and then use it to model some simple 
3D scenes. Section 12.6.1 describes other, potentially multi-view, approaches to architectural 
reconstruction, including an interactive piecewise-planar modeling system that uses vanishing 
points to establish 3D line directions and plane normals (Sinha, Steedly, Szeliski el al. 2008). 


6.3.4 Rotational motion 

When no calibration targets or known structures are available but you can rotate the camera 
around its front nodal point (or, equivalently, work in a large open environment where all ob- 
jects are distant), the camera can be calibrated from a set of overlapping images by assuming 
that it is undergoing pure rotational motion, as shown in Figure 6.11 (Stein 1995; Hartley 
1997b; Hartley, Hayman, de Agapito el al. 2000; de Agapito, Hayman, and Reid 2001; Kang 
and Weiss 1999; Shum and Szeliski 2000; Frahm and Koch 2003). When a full 360° mo- 
tion is used to perform this calibration, a very accurate estimate of the focal length / can be 
obtained, as the accuracy in this estimate is proportional to the total number of pixels in the 
resulting cylindrical panorama (Section 9.1.6) (Stein 1995; Shum and Szeliski 2000). 

To use this technique, we first compute the homographies if y between all overlapping 
pairs of images, as explained in Equations (6.19-6.23). Then, we use the observation, first 
made in Equation (2.72) and explored in more detail in Section 9.1.3 (9.5), that each homog- 
raphy is related to the inter-camera rotation i?,j through the (unknown) calibration matrices 
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K, and Kj, 

fl. j = K,R,RJ 1 KJ 1 = K , R, , l< (6.52) 

The simplest way to obtain the calibration is to use the simplified form of the calibra- 
tion matrix (2.59), where we assume that the pixels are square and the optical center lies at 
the center of the image, i.e., Kk = diag (/*,, /),, , 1). (We number the pixel coordinates ac- 
cordingly, i.e., place pixel (x. y) = (0,0) at the center of the image.) We can then rewrite 
Equation (6.52) as 



hoo 

hoi 

fo 1 ho 2 

Rio ~ K^HuKo ~ 

h to 

h\i 

fo 1 hl 2 


/l^20 

/t^-21 

fo 1 /l^22 


(6.53) 


where hij are the elements of H w . 

Using the orthonormality properties of the rotation matrix Rio and the fact that the right 
hand side of (6.53) is known only up to a scale, we obtain 


h 2 _i_ h 2 _i_ f~ 2 h 2 — h 2 _i_ h 2 + f~ 2 h 2 

"oo ^ ,t oi + Jo "02 — 'no n- it ii t j o 'n 


(6.54) 


and 


hoohio + hoihn + f 0 2 /io2^i2 — 0. 


From this, we can compute estimates for fo of 


fo = 


h 2 - h 2 
,l l2 n 02 


7,2 I h2 _ u2 _ u 2 
^00 ' ,L 11 


if hoo 


h 2 


A 01 




hu 


(6.55) 


(6.56) 


fo = " 


h 0 2h 


02"12 


if h 00 hio 7^ -hoihn. 


(6.57) 


hoohio + hoihn 

(Note that the equations originally given by Szeliski and Shum (1997) are erroneous; the 
correct equations are given by Shum and Szeliski (2000).) If neither of these conditions 
holds, we can also take the dot products between the first (or second) row and the third one. 
Similar results can be obtained for fi as well, by analyzing the columns of H w . If the focal 
length is the same for both images, we can take the geometric mean of fo and fi as the 
estimated focal length / = y/fifo- When multiple estimates of / are available, e.g., from 
different homographies, the median value can be used as the final estimate. 

A more general (upper-triangular) estimate of K can be obtained in the case of a fixed- 
parameter camera Ki = K using the technique of Hartley (1997b). Observe from (6.52) 


that R, 
K'H, t K 


K 


I1. K and R ‘ 


— T , 


K 1 H ' K T . Equating R,j = R i , j ‘ we obtain 


-T 


~—T , 


K 1 H il K T , from which we get 


Hij(KK ) ~ (KK )H 


x -T 


(6.58) 
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This provides us with some homogeneous linear constraints on the entries in A = KK T , 
which is known as the dual of the image of the absolute conic (Hartley 1997b; Hartley and 
Zisserman 2004). (Recall that when we estimate a homography, we can only recover it up to 
an unknown scale.) Given a sufficient number of independent homography estimates Hij, 
we can recover A (up to a scale) using either SVD or eigenvalue analysis and then recover 
K through Cholesky decomposition (Appendix A. 1 .4). Extensions to the cases of temporally 
varying calibration parameters and non-stationary cameras are discussed by Hartley, Hayman, 
de Agapito et al. (2000) and de Agapito, Hayman, and Reid (2001). 

The quality of the intrinsic camera parameters can be greatly increased by constructing a 
full 360° panorama, since mis-estimating the focal length will result in a gap (or excessive 
overlap) when the first image in the sequence is stitched to itself (Figure 9.5). The resulting 
mis-alignment can be used to improve the estimate of the focal length and to re-adjust the 
rotation estimates, as described in Section 9. 1 .4. Rotating the camera by 90° around its optic 
axis and re-shooting the panorama is a good way to check for aspect ratio and skew pixel 
problems, as is generating a full hemi-spherical panorama when there is sufficient texture. 

Ultimately, however, the most accurate estimate of the calibration parameters (including 
radial distortion) can be obtained using a full simultaneous non-linear minimization of the 
intrinsic and extrinsic (rotation) parameters, as described in Section 9.2. 

6.3.5 Radial distortion 

When images are taken with wide-angle lenses, it is often necessary to model lens distor- 
tions such as radial distortion. As discussed in Section 2.1.6, the radial distortion model 
says that coordinates in the observed images are displaced away from ( barrel distortion) or 
towards (jrincushion distortion) the image center by an amount proportional to their radial 
distance (Figure 2.13a-b). The simplest radial distortion models use low-order polynomials 
(c.f. Equation (2.78)), 


x = x{l + Kir 2 + K 2 r 4 ) 

y = t/(l + Kir 2 + K 2 r 4 ), (6.59) 

where r 2 = x 2 + y 2 and ki and k 2 are called the radial distortion parameters (Brown 1971; 
Slama 1980). 13 

A variety of techniques can be used to estimate the radial distortion parameters for a 
given lens. 14 One of the simplest and most useful is to take an image of a scene with a lot 

13 Sometimes the relationship between x and x is expressed the other way around, i.e., using primed (final) 

coordinates on the right-hand side, x = I + Kir 2 + « 2 r 4 ). This is convenient if we map image pixels into 
(warped) rays and then undistort the rays to obtain 3D rays in space, i.e., if we are using inverse warping. 

14 Some of today’s digital cameras are starting to remove radial distortion using software in the camera itself. 
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of straight lines, especially lines aligned with and near the edges of the image. The radial 
distortion parameters can then be adjusted until all of the lines in the image are straight, 
which is commonly called the plumb-line method (Brown 1971; Kang 2001; El-Melegy and 
Farag 2003). Exercise 6.10 gives some more details on how to implement such a technique. 

Another approach is to use several overlapping images and to combine the estimation 
of the radial distortion parameters with the image alignment process, i.e., by extending the 
pipeline used for stitching in Section 9.2.1. Sawhney and Kumar (1999) use a hierarchy 
of motion models (translation, affine, projective) in a coarse-to-fine strategy coupled with 
a quadratic radial distortion correction term. They use direct (intensity-based) minimiza- 
tion to compute the alignment. Stein (1997) uses a feature-based approach combined with 
a general 3D motion model (and quadratic radial distortion), which requires more matches 
than a parallax-free rotational panorama but is potentially more general. More recent ap- 
proaches sometimes simultaneously compute both the unknown intrinsic parameters and the 
radial distortion coefficients, which may include higher-order terms or more complex rational 
or non-parametric forms (Claus and Fitzgibbon 2005; Sturm 2005; Thirthala and Pollefeys 
2005; Barreto and Daniilidis 2005; Hartley and Kang 2005; Steele and Jaynes 2006; Tardif, 
Sturm, Trudeau et al. 2009). 

When a known calibration target is being used (Figure 6.8), the radial distortion estima- 
tion can be folded into the estimation of the other intrinsic and extrinsic parameters (Zhang 
2000; Hartley and Kang 2007; Tardif, Sturm, Trudeau el al. 2009). This can be viewed as 
adding another stage to the general non-linear minimization pipeline shown in Figure 6.5 
between the intrinsic parameter multiplication box f c and the perspective division box / P . 
(See Exercise 6.1 1 on more details for the case of a planar calibration target.) 

Of course, as discussed in Section 2.1.6, more general models of lens distortion, such as 
fisheye and non-central projection, may sometimes be required. While the parameterization 
of such lenses may be more complicated (Section 2.1.6), the general approach of either us- 
ing calibration rigs with known 3D positions or self-calibration through the use of multiple 
overlapping images of a scene can both be used (Hartley and Kang 2007; Tardif, Sturm, and 
Roy 2007). The same techniques used to calibrate for radial distortion can also be used to 
reduce the amount of chromatic aberration by separately calibrating each color channel and 
then warping the channels to put them back into alignment (Exercise 6.12). 


6.4 Additional reading 

Hartley and Zisserman (2004) provide a wonderful introduction to the topics of feature-based 
alignment and optimal motion estimation, as well as an in-depth discussion of camera cali- 
bration and pose estimation techniques. 
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Techniques for robust estimation are discussed in more detail in Appendix B.3 and in 
monographs and review articles on this topic (Huber 1981; Hampel, Ronchetti, Rousseeuw et 
al. 1986; Rousseeuw and Leroy 1987; Black and Rangarajan 1996; Stewart 1999). The most 
commonly used robust initialization technique in computer vision is RANdom SAmple Con- 
sensus (RANSAC) (Fischler and Bolles 1981), which has spawned a series of more efficient 
variants (Nister 2003; Chum and Matas 2005). 

The topic of registering 3D point data sets is called absolute orientation (Horn 1987) and 
3D pose estimation (Lorusso, Eggert, and Fisher 1995). A variety of techniques has been 
developed for simultaneously computing 3D point correspondences and their corresponding 
rigid transformations (Besl and McKay 1992; Zhang 1994; Szeliski and Lavallee 1996; Gold, 
Rangarajan, Lu et al. 1998; David, DeMenthon, Duraiswami et al. 2004; Li and Hartley 2007; 
Enqvist, Josephson, and Kahl 2009). 

Camera calibration was first studied in photogrammetry (Brown 1971; Slama 1980; Atkin- 
son 1996; Kraus 1997) but it has also been widely studied in computer vision (Tsai 1987; 
Gremban, Thorpe, and Kanade 1988; Champleboux, Lavallee, Szeliski et al. 1992; Zhang 
2000; Grossberg and Nayar 2001). Vanishing points observed either from rectahedral cali- 
bration objects or man-made architecture are often used to perform rudimentary calibration 
(Caprile and Torre 1990; Becker and Bove 1995; Liebowitz and Zisserman 1998; Cipolla, 
Drummond, and Robertson 1999; Antone and Teller 2002; Criminisi, Reid, and Zisserman 
2000; Hartley and Zisserman 2004; Pflugfelder 2008). Performing camera calibration without 
using known targets is known as self-calibration and is discussed in textbooks and surveys on 
structure from motion (Faugeras, Luong, and Maybank 1992; Hartley and Zisserman 2004; 
Moons, Van Gool, and Vergauwen 2010). One popular subset of such techniques uses pure 
rotational motion (Stein 1995; Hartley 1997b; Hartley, Hayman, de Agapito et al. 2000; de 
Agapito, Hayman, and Reid 2001; Kang and Weiss 1999; Shum and Szeliski 2000; Frahm 
and Koch 2003). 


6.5 Exercises 

Ex 6.1: Feature-based image alignment for flip-book animations Take a set of photos of 
an action scene or portrait (preferably in motor-drive — continuous shooting — mode) and 
align them to make a composite or flip-book animation. 

1 . Extract features and feature descriptors using some of the techniques described in Sec- 
tions 4. 1.1-4. 1.2. 

2. Match your features using nearest neighbor matching with a nearest neighbor distance 
ratio test (4.18). 
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3. Compute an optimal 2D translation and rotation between the first image and all subse- 
quent images, using least squares (Section 6.1.1) with optional RANSAC for robustness 
(Section 6.1.4). 

4. Resample all of the images onto the first image’s coordinate frame (Section 3.6.1) using 
either bilinear or bicubic resampling and optionally crop them to their common area. 

5. Convert the resulting images into an animated GIF (using software available from the 
Web) or optionally implement cross-dissolves to turn them into a “slo-mo” video. 

6. (Optional) Combine this technique with feature-based (Exercise 3.25) morphing. 

Ex 6.2: Panography Create the kind of panograph discussed in Section 6.1.2 and com- 
monly found on the Web. 

1. Take a series of interesting overlapping photos. 

2. Use the feature detector, descriptor, and matcher developed in Exercises 4. 1^1.4 (or 
existing software) to match features among the images. 

3. Turn each connected component of matching features into a track , i.e., assign a unique 
index i to each track, discarding any tracks that are inconsistent (contain two different 
features in the same image). 

4. Compute a global translation for each image using Equation (6.12). 

5. Since your matches probably contain errors, turn the above least square metric into a 
robust metric (6.25) and re-solve your system using iteratively reweighted least squares. 

6. Compute the size of the resulting composite canvas and resample each image into its 
final position on the canvas. (Keeping track of bounding boxes will make this more 
efficient.) 

7. Average all of the images, or choose some kind of ordering and implement translucent 
over compositing (3.8). 

8. (Optional) Extend your parametric motion model to include rotations and scale, i.e., 
the similarity transform given in Table 6.1. Discuss how you could handle the case of 
translations and rotations only (no scale). 

9. (Optional) Write a simple tool to let the user adjust the ordering and opacity, and add 
or remove images. 
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10. (Optional) Write down a different least squares problem that involves pairwise match- 
ing of images. Discuss why this might be better or worse than the global matching 
formula given in (6. 12). 

Ex 6.3: 2D rigid/Euclidean matching Several alternative approaches are given in Section 6.1.3 
for estimating a 2D rigid (Euclidean) alignment. 

1. Implement the various alternatives and compare their accuracy on synthetic data, i.e., 
random 2D point clouds with noisy feature positions. 

2. One approach is to estimate the translations from the centroids and then estimate ro- 
tation in polar coordinates. Do you need to weight the angles obtained from a polar 
decomposition in some way to get the statistically correct estimate? 

3. How can you modify your techniques to take into account either scalar (6.10) or full 
two-dimensional point covariance weightings (6.11)7 Do all of the previously devel- 
oped “shortcuts” still work or does full weighting require iterative optimization? 

Ex 6.4: 2D match move/augmented reality Replace a picture in a magazine or a book 
with a different image or video. 

1 . With a webcam, take a picture of a magazine or book page. 

2. Outline a figure or picture on the page with a rectangle, i.e., draw over the four sides as 
they appear in the image. 

3. Match features in this area with each new image frame. 

4. Replace the original image with an “advertising” insert, warping the new image with 
the appropriate homography. 

5. Try your approach on a clip from a sporting event (e.g., indoor or outdoor soccer) to 
implement a billboard replacement. 

Ex 6.5: 3D joystick Track a Rubik’s cube to implement a 3D joystick/mouse control. 

1. Get out an old Rubik’s cube (or get one from your parents). 

2. Write a program to detect the center of each colored square. 

3. Group these centers into lines and then find the vanishing points for each face. 

4. Estimate the rotation angle and focal length from the vanishing points. 
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5. Estimate the full 3D pose (including translation) by finding one or more 3x3 grids and 
recovering the plane’s full equation from this known homography using the technique 
developed by Zhang (2000). 

6. Alternatively, since you already know the rotation, simply estimate the unknown trans- 
lation from the known 3D corner points on the cube and their measured 2D locations 
using either linear or non-linear least squares. 

7. Use the 3D rotation and position to control a VRML or 3D game viewer. 

Ex 6.6: Rotation-based calibration Take an outdoor or indoor sequence from a rotating 
camera with very little parallax and use it to calibrate the focal length of your camera using 
the techniques described in Section 6.3.4 or Sections 9. 1.3-9. 2.1. 

1. Take out any radial distortion in the images using one of the techniques from Exer- 
cises 6.10-6.1 1 or using parameters supplied for a given camera by your instructor. 

2. Detect and match feature points across neighboring frames and chain them into feature 
tracks. 

3. Compute homographies between overlapping frames and use Equations (6.56-6.57) to 
get an estimate of the focal length. 

4. Compute a full 360° panorama and update your focal length estimate to close the gap 
(Section 9.1.4). 

5. (Optional) Perform a complete bundle adjustment in the rotation matrices and focal 
length to obtain the highest quality estimate (Section 9.2.1). 

Ex 6.7: Target-based calibration Use a three-dimensional target to calibrate your camera. 

1. Construct a three-dimensional calibration pattern with known 3D locations. It is not 
easy to get high accuracy unless you use a machine shop, but you can get close using 
heavy plywood and printed patterns. 

2. Find the corners, e.g, using a line finder and intersecting the lines. 

3. Implement one of the iterative calibration and pose estimation algorithms described 
in Tsai (1987); Bogart (1991); Gleicher and Witkin (1992) or the system described in 
Section 6.2.2. 

4. Take many pictures at different distances and orientations relative to the calibration 
target and report on both your re-projection errors and accuracy. (To do the latter, you 
may need to use simulated data.) 
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Ex 6.8: Calibration accuracy Compare the three calibration techniques (plane-based, rotation- 
based, and 3D-target-based). 

One approach is to have a different student implement each one and to compare the results. 
Another approach is to use synthetic data, potentially re-using the software you developed 
for Exercise 2.3. The advantage of using synthetic data is that you know the ground truth 
for the calibration and pose parameters, you can easily run lots of experiments, and you can 
synthetically vary the noise in your measurements. 

Here are some possible guidelines for constructing your test sets: 

1. Assume a medium-wide focal length (say, 50° field of view). 

2. For the plane-based technique, generate a 2D grid target and project it at different 
inclinations. 

3. For a 3D target, create an inner cube corner and position it so that it fills most of field 
of view. 

4. For the rotation technique, scatter points uniformly on a sphere until you get a similar 
number of points as for other techniques. 

Before comparing your techniques, predict which one will be the most accurate (normalize 
your results by the square root of the number of points used). 

Add varying amounts of noise to your measurements and describe the noise sensitivity of 
your various techniques. 

Ex 6.9: Single view metrology Implement a system to measure dimensions and reconstruct 
a 3D model from a single image of a man-made scene using visible vanishing directions (Sec- 
tion 6.3.3) (Criminisi, Reid, and Zisserman 2000). 

1 . Find the three orthogonal vanishing points from parallel lines and use them to establish 
the three coordinate axes (rotation matrix R of the camera relative to the scene). If 
two of the vanishing points are finite (not at infinity), use them to compute the focal 
length, assuming a known optical center. Otherwise, find some other way to calibrate 
your camera; you could use some of the techniques described by Schaffalitzky and 
Zisserman (2000). 

2. Click on a ground plane point to establish your origin and click on a point a known 
distance away to establish the scene scale. This lets you compute the translation t 
between the camera and the scene. As an alternative, click on a pair of points, one 
on the ground plane and one above it, and use the known height to establish the scene 
scale. 
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3. Write a user interface that lets you click on ground plane points to recover their 3D 
locations. (Hint: you already know the camera matrix, so knowledge of a point’s z 
value is sufficient to recover its 3D location.) Click on pairs of points (one on the 
ground plane, one above it) to measure vertical heights. 

4. Extend your system to let you draw quadrilaterals in the scene that correspond to axis- 
aligned rectangles in the world, using some of the techniques described by Sinha, 
Steedly, Szeliski et al. (2008). Export your 3D rectangles to a VRML or PLY 15 file. 

5. (Optional) Warp the pixels enclosed by the quadrilateral using the correct homography 
to produce a texture map for each planar polygon. 

Ex 6.10: Radial distortion with plumb lines Implement a plumb-line algorithm to deter- 
mine the radial distortion parameters. 

1. Take some images of scenes with lots of straight lines, e.g., hallways in your home or 
office, and try to get some of the lines as close to the edges of the image as possible. 

2. Extract the edges and link them into curves, as described in Section 4.2.2 and Exer- 
cise 4.8. 

3. Fit quadratic or elliptic curves to the linked edges using a generalization of the suc- 
cessive line approximation algorithm described in Section 4.3.1 and Exercise 4.1 1 and 
keep the curves that fit this form well. 

4. For each curved segment, fit a straight line and minimize the perpendicular distance 
between the curve and the line while adjusting the radial distortion parameters. 

5. Alternate between re-fitting the straight line and adjusting the radial distortion param- 
eters until convergence. 

Ex 6.11: Radial distortion with a calibration target Use a grid calibration target to de- 
termine the radial distortion parameters. 

1 . Print out a planar calibration target, mount it on a stiff board, and get it to fill your field 
of view. 

2. Detect the squares, lines, or dots in your calibration target. 

3. Estimate the homography mapping the target to the camera from the central portion of 
the image that does not have any radial distortion. 

15 


http://meshlab.sf.net. 
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4. Predict the positions of the remaining targets and use the differences between the ob- 
served and predicted positions to estimate the radial distortion. 

5. (Optional) Fit a general spline model (for severe distortion) instead of the quartic dis- 
tortion model. 

6. (Optional) Extend your technique to calibrate a fisheye lens. 

Ex 6.12: Chromatic aberration Use the radial distortion estimates for each color channel 
computed in the previous exercise to clean up wide-angle lens images by warping all of the 
channels into alignment. (Optional) Straighten out the images at the same time. 

Can you think of any reasons why this warping strategy may not always work? 
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Figure 7.1 Structure from motion systems: (a-d) orthographic factorization (Tomasi and 
Kanade 1992) © 1992 Springer; (e-f) line matching (Schmid and Zisserman 1997) © 1997 
IEEE; (g-k) incremental structure from motion (Snavely, Seitz, and Szeliski 2006); (1) 3D 
reconstruction of Trafalgar Square (Snavely, Seitz, and Szeliski 2006); (m) 3D reconstruction 
of the Great Wall of China (Snavely, Seitz, and Szeliski 2006); (n) 3D reconstruction of the 
Old Town Square, Prague (Snavely, Seitz, and Szeliski 2006) © 2006 ACM. 
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In the previous chapter, we saw how 2D and 3D point sets could be aligned and how such 
alignments could be used to estimate both a camera’s pose and its internal calibration parame- 
ters. In this chapter, we look at the converse problem of estimating the locations of 3D points 
from multiple images given only a sparse set of correspondences between image features. 
While this process often involves simultaneously estimating both 3D geometry (structure) 
and camera pose (motion), it is commonly known as structure from motion (Ullman 1979). 

The topics of projective geometry and structure from motion are extremely rich and 
some excellent textbooks and surveys have been written on them (Faugeras and Luong 2001; 
Hartley and Zisserman 2004; Moons, Van Gool, and Vergauwen 2010). This chapter skips 
over a lot of the richer material available in these books, such as the trifocal tensor and al- 
gebraic techniques for full self-calibration, and concentrates instead on the basics that we 
have found useful in large-scale, image-based reconstruction problems (Snavely, Seitz, and 
Szeliski 2006). 

We begin with a brief discussion of triangulation (Section 7.1), which is the problem of 
estimating a point’s 3D location when it is seen from multiple cameras. Next, we look at the 
two-frame structure from motion problem (Section 7.2), which involves the determination of 
the epipolar geometry between two cameras and which can also be used to recover certain 
information about the camera intrinsics using self-calibration (Section 7.2.2). Section 7.3 
looks at factorization approaches to simultaneously estimating structure and motion from 
large numbers of point tracks using orthographic approximations to the projection model. 
We then develop a more general and useful approach to structure from motion, namely the 
simultaneous bundle adjustment of all the camera and 3D structure parameters (Section 7.4). 
We also look at special cases that arise when there are higher-level structures, such as lines 
and planes, in the scene (Section 7.5). 


7.1 Triangulation 

The problem of determining a point’s 3D position from a set of corresponding image locations 
and known camera positions is known as triangulation. This problem is the converse of the 
pose estimation problem we studied in Section 6.2. 

One of the simplest ways to solve this problem is to find the 3D point p that lies closest to 
all of the 3D rays corresponding to the 2D matching feature locations {xj } observed by cam- 
eras {Pj = K j[Rj\tj]}, where tj = —RjCj and Cj is the jth camera center (2.55-2.56). 
As you can see in Figure 7.2, these rays originate at Cj in a direction Vj = N(Rf 1 Kf l Xj). 
The nearest point to p on this ray, which we denote as q ■ , minimizes the distance 

II Cj + djVj -p|| 2 , 


(7.1) 
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Figure 7.2 3D point triangulation by finding the point p that lies nearest to all of the optical 

rays Cj + djVj. 


which has a minimum at dj = Vj ■ (p — Cj). Hence, 

Qj = Cj + (Vjvj)(p - cj) = Cj + {p- Cj ) || , (7.2) 

in the notation of Equation (2.29), and the squared distance between p and q t is 

r 2 j = \\(I ~ Vjvj)(p - Cj )\\ 2 = ||(p-c i ) J _|| 2 . (7.3) 


The optimal value for p, which lies closest to all of the rays, can be computed as a regular 
least squares problem by summing over all the r| and finding the optimal value of p. 


p = 


YjV-Vjvf) 


Y^^-VjV^Cj 

3 


(7.4) 


An alternative formulation, which is more statistically optimal and which can produce 
significantly better estimates if some of the cameras are closer to the 3D point than others, is 
to minimize the residual in the measurement equations 


Vi 


p^X+p^Y+p^Z+p^W 
p^X + p^Y + p^Z + p^w’ 


(7.5) 

(7.6) 


where [xj , yj ) are the measured 2D feature locations and {pq'o ■ ■ • P 23 } ^ the known entries 
in camera matrix Pj (Sutherland 1974). 

As with Equations (6.21, 6.33, and 6.34), this set of non-linear equations can be converted 
into a linear least squares problem by multiplying both sides of the denominator. Note that if 
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we use homogeneous coordinates p = (X, Y, Z. W), the resulting set of equations is homo- 
geneous and is best solved as a singular value decomposition (SVD) or eigenvalue problem 
(looking for the smallest singular vector or eigenvector). If we set W = 1, we can use regular 
linear least squares, but the resulting system may be singular or poorly conditioned, i.e., if all 
of the viewing rays are parallel, as occurs for points far away from the camera. 

For this reason, it is generally preferable to parameterize 3D points using homogeneous 
coordinates, especially if we know that there are likely to be points at greatly varying dis- 
tances from the cameras. Of course, minimizing the set of observations (7. 5-7. 6) using non- 
linear least squares, as described in (6. 14 and 6.23), is preferable to using linear least squares, 
regardless of the representation chosen. 

For the case of two observations, it turns out that the location of the point p that exactly 
minimizes the true reprojection error (7. 5-7. 6) can be computed using the solution of degree 
six equations (Hartley and Sturm 1997). Another problem to watch out for with triangulation 
is the issue of chirality , i.e., ensuring that the reconstructed points lie in front of all the 
cameras (Hartley 1998). While this cannot always be guaranteed, a useful heuristic is to take 
the points that lie behind the cameras because their rays are diverging (imagine Figure 7.2 
where the rays were pointing away from each other) and to place them on the plane at infinity 
by setting their W values to 0. 


So far in our study of 3D reconstruction, we have always assumed that either the 3D point 
positions or the 3D camera poses are known in advance. In this section, we take our first look 
at structure from motion , which is the simultaneous recovery of 3D structure and pose from 
image correspondences. 

Consider Figure 7.3, which shows a 3D point p being viewed from two cameras whose 
relative position can be encoded by a rotation R and a translation t. Since we do not know 
anything about the camera positions, without loss of generality, we can set the first camera at 
the origin Cq = 0 and at a canonical orientation Rq = I. 

Now notice that the observed location of point p in the first image, p Q = doXo is mapped 
into the second image by the transformation 


where Xj = Kj 1 Xj are the (local) ray direction vectors. Taking the cross product of both 
sides with t in order to annihilate it on the right hand side yields 1 
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di&i = Pi = Rp 0 + t = R(d 0 x 0 ) + f, 


(7.7) 


d 1 [t) x x 1 = d 0 [t] x Rx 0 . 


(7.8) 


1 The cross-product operator [ ] x was introduced in (2.32). 
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Figure 7.3 Epipolar geometry: The vectors t = C\ — cq, p — Cq and p — c i are co-planar 
and define the basic epipolar constraint expressed in terms of the pixel measurements x 0 and 

Xi. 


Taking the dot product of both sides with X\ yields 

d 0 x[ ([t] x -R)®o = d 1 Xi[t\ x x 1 = 0, (7.9) 

since the right hand side is a triple product with two identical entries. (Another way to say 
this is that the cross product matrix [i] x is skew symmetric and returns 0 when pre- and 
post-multiplied by the same vector.) 

We therefore arrive at the basic epipolar constraint 

x^Ex o=0, (7.10) 

where 

E=[t} x R (7.11) 

is called the essential matrix (Longuet-Higgins 1981). 

An alternative way to derive the epipolar constraint is to notice that in order for the cam- 
eras to be oriented so that the rays Xq and X\ intersect in 3D at point p, the vectors connecting 
the two camera centers C\ — Cq = - ijj" 1 t and the rays corresponding to pixels and x \ , 

namely Rj 1 Xj , must be co-planar. This requires that the triple product 

( x 0 , R~ x xi, —R^t) = (Rx o, xi,—t) = x\ • (tx Rxq) = x^([f] x il)a;o = 0. (7.12) 

Notice that the essential matrix E maps a point Xq in image 0 into a line l\ = Ex o 
in image 1, since x^l i = 0 (Figure 7.3). All such lines must pass through the second 
epipole ei, which is therefore defined as the left singular vector of E with a 0 singular value, 
or, equivalently, the projection of the vector t into image 1. The dual (transpose) of these 
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relationships gives us the epipolar line in the first image as Iq = E r x-\ and eo as the zero- 
value right singular vector of E. 

Given this fundamental relationship (7.10), how can we use it to recover the camera 
motion encoded in the essential matrix El If we have N corresponding measurements 
{(a:,;o, we can form N homogeneous equations in the nine elements of E = {eoo • ■ ■ 622}, 


a^o^iieoo + Vio^n^oi + xneo2 + 

£io2/ii e oo + ywyneu + Vi \£\i + (7.13) 

Xioe2o + yioe .21 + e22 = 0 

where Xij = ( Xij , yij, 1). This can be written more compactly as 

[xn xf 0 \ <g> E = Z i <g> E = Zi ■ f = 0, (7. 14) 

where ® indicates an element-wise multiplication and summation of matrix elements, and z, 
and / are the rasterized (vector) forms of the Z r = x, i x j 0 and E matrices. 2 Given N > 8 
such equations, we can compute an estimate (up to scale) for the entries in E using an SVD. 

In the presence of noisy measurements, how close is this estimate to being statistically 
optimal? If you look at the entries in (7.13), you can see that some entries are the products 
of image measurements such as cc^o Z/ii a >'d others are direct image measurements (or even 
the identity). If the measurements have comparable noise, the terms that are products of 
measurements have their noise amplified by the other element in the product, which can lead 
to very poor scaling, e.g., an inordinately large influence of points with large coordinates (far 
away from the image center). 

In order to counteract this trend. Hartley (1997a) suggests that the point coordinates 
should be translated and scaled so that their centroid lies at the origin and their variance 
is unity, i.e.. 


Xi — S^Xi Ma:) (7,15) 

iji = s(xi~Hy) (7.16) 

such that Xi = E i Vi = 0 and Ei xf + Vi ~ 2^, where n is the number of points. 3 

Once the essential matrix E has been computed from the transformed coordinates 
{(xiQ, Xu )}, where x^ = TjXij, the original essential matrix E can be recovered as 

E = T ± ETo. (7.17) 

2 We use f instead of e to denote the rasterized form of E to avoid confusion with the epipoles e.j . 

3 More precisely. Hartley (1997a) suggests scaling the points “so that the average distance from the origin is equal 
to \/2” but the heuristic of unit variance is faster to compute (does not require per-point square roots) and should 
yield comparable improvements. 
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In his paper. Hartley (1997a) compares the improvement due to his re -normalization strategy 
to alternative distance measures proposed by others such as Zhang (1998a,b) and concludes 
that his simple re -normalization in most cases is as effective as (or better than) alternative 
techniques. Torr and Fitzgibbon (2004) recommend a variant on this algorithm where the 
norm of the upper 2x2 sub-matrix of E is set to 1 and show that it has even better stability 
with respect to 2D coordinate transformations. 

Once an estimate for the essential matrix E has been recovered, the direction of the trans- 
lation vector t can be estimated. Note that the absolute distance between the two cameras can 
never be recovered from pure image measurements alone, regardless of how many cameras 
or points are used. Knowledge about absolute camera and point positions or distances, of- 
ten called ground control points in photogrammetry, is always required to establish the final 
scale, position, and orientation. 

To estimate this direction t, observe that under ideal noise-free conditions, the essential 
matrix E is singular, i.e., t E = 0. This singularity shows up as a singular value of 0 when 
an SVD of E is performed. 





’ 1 


T 

v 0 

E= [t\ x R=UY,V T = 

Uq U 1 t 
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T 

vf 




0 


T 
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When E is computed from noisy measurements, the singular vector associated with the small- 
est singular value gives us t . (The other two singular values should be similar but are not, in 
general, equal to 1 because E is only computed up to an unknown scale.) 

Because E is rank-deficient, it turns out that we actually only need seven correspondences 
of the form of Equation (7. 14) instead of eight to estimate this matrix (Hartley 1994a; Torr and 
Murray 1997; Hartley and Zisserman 2004). (The advantage of using fewer correspondences 
inside a RANSAC robust fitting stage is that fewer random samples need to be generated.) 
From this set of seven homogeneous equations (which we can stack into a 7 x 9 matrix for 
SVD analysis), we can find two independent vectors, say f Q and f 1 such that z, ■ fj = 0. 
These two vectors can be converted back into 3x3 matrices E 0 and E t , which span the 
solution space for 

E = aE 0 + (l-a)Ei. (7.19) 

To find the correct value of a, we observe that E has a zero determinant, since it is rank 
deficient, and hence 

det | aE 0 + (1 - a)E 1 \ = 0. (7.20) 

This gives us a cubic equation in a, which has either one or three solutions (roots). Substitut- 
ing these values into (7.19) to obtain E, we can test this essential matrix against other unused 
feature correspondences to select the correct one. 
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Once t has been recovered, how can we estimate the corresponding rotation matrix R’ 
Recall that the cross-product operator [t] x (2.32) projects a vector onto a set of orthogonal 
basis vectors that include t, zeros out the t component, and rotates the other two by 90°, 
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where t = Sq x Si. From Equations (7.18 and 7.21), we get 

E=[t\ x R= SZRwo S t R = UT,V t 7 (7.22) 

from which we can conclude that S = U. Recall that for a noise-free essential matrix, 
(S = Z), and hence 

R 90 oU t R=V t (7.23) 

and 

R=UR% 0 oV t . (7.24) 

Unfortunately, we only know both E and t up to a sign. Furthermore, the matrices U and V 
are not guaranteed to be rotations (you can flip both their signs and still get a valid S VD). For 
this reason, we have to generate all four possible rotation matrices 


r = ±ur1 90 oV t 


(7.25) 


and keep the two whose determinant |J?| = 1. To disambiguate between the remaining pair 
of potential rotations, which form a twisted pair (Hartley and Zisserman 2004, p. 240), we 
need to pair them with both possible signs of the translation direction ±t and select the 
combination for which the largest number of points is seen in front of both cameras. 4 

The property that points must lie in front of the camera, i.e., at a positive distance along 
the viewing rays emanating from the camera, is known as chirality (Hartley 1998). In addition 
to determining the signs of the rotation and translation, as described above, the chirality (sign 
of the distances) of the points in a reconstruction can be used inside a RANSAC procedure 
(along with the reprojection errors) to distinguish between likely and unlikely configurations. 5 
Chirality can also be used to transform projective reconstructions (Sections 7.2.1 and 7.2.2) 
into quasi-affine reconstructions (Hartley 1998). 

The normalized “eight-point algorithm” (Hartley 1997a) described above is not the only 
way to estimate the camera motion from correspondences. Variants include using seven points 

4 In the noise-free case, a single point suffices. It is safer, however, to test all or a sufficient subset of points, 
downweighting the ones that lie close to the plane at infinity, for which it is easy to get depth reversals. 

5 Note that as points get further away from a camera, i.e., closer toward the plane at infinity, errors in chirality 
become more likely. 
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Figure 7.4 Pure translational camera motion results in visual motion where all the points 
move towards (or away from) a common focus of expansion (FOE) e. They therefore satisfy 
the triple product condition (xq, Xi,e) = e ■ (xg x xf) = 0. 


while enforcing the rank two constraint in E (7.19-7.20) and a five-point algorithm that 
requires finding the roots of a 10th degree polynomial (Nister 2004). Since such algorithms 
use fewer points to compute their estimates, they are less sensitive to outliers when used as 
part of a random sampling (RANSAC) strategy. 

Pure translation (known rotation) 

In the case where we know the rotation, we can pre-rotate the points in the second image to 
match the viewing direction of the first. The resulting set of 3D points all move towards (or 
away from) the focus of expansion (FOE), as shown in Figure 7. 4. 6 The resulting essential 
matrix E is (in the noise-free case) skew symmetric and so can be estimated more directly by 
setting eij = -e :r , and e,j = 0 in (7.13). Two points with non-zero parallax now suffice to 
estimate the FOE. 

A more direct derivation of the FOE estimate can be obtained by minimizing the triple 
product 

(xig , Xu , e) — y x Xu') • (7.26) 

i i 

which is equivalent to finding the null space for the set of equations 

(iJio - Vn)eo + (xn - Xio)ei + (x i0 yn - yioXn)e 2 = 0. (7.27) 

Note that, as in the eight-point algorithm, it is advisable to normalize the 2D points to have 
unit variance before computing this estimate. 

In situations where a large number of points at infinity are available, e.g., when shooting 
outdoor scenes or when the camera motion is small compared to distant objects, this suggests 
an alternative RANSAC strategy for estimating the camera motion. First, pick a pair of 
points to estimate a rotation, hoping that both of the points lie at infinity (very far from the 

6 Fans of Star Trek and Star Wars will recognize this as the “jump to hyperdrive” visual effect. 
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camera). Then, compute the FOE and check whether the residual error is small (indicating 
agreement with this rotation hypothesis) and whether the motions towards or away from the 
epipole (FOE) are all in the same direction (ignoring very small motions, which may be 
noise-contaminated). 

Pure rotation 

The case of pure rotation results in a degenerate estimate of the essential matrix E and of 
the translation direction t. Consider first the case of the rotation matrix being known. The 
estimates for the FOE will be degenerate, since x i0 w x lA , and hence (7.27), is degenerate. 
A similar argument shows that the equations for the essential matrix (7.13) are also rank- 
deficient. 

This suggests that it might be prudent before computing a full essential matrix to first 
compute a rotation estimate R using (6.32), potentially with just a small number of points, 
and then compute the residuals after rotating the points before proceeding with a full E 
computation. 


7.2.1 Projective (uncalibrated) reconstruction 

In many cases, such as when trying to build a 3D model from Internet or legacy photos taken 
by unknown cameras without any EXIF tags, we do not know ahead of time the intrinsic 
calibration parameters associated with the input images. In such situations, we can still esti- 
mate a two-frame reconstruction, although the true metric structure may not be available, e.g., 
orthogonal lines or planes in the world may not end up being reconstructed as orthogonal. 

Consider the derivations we used to estimate the essential matrix E (7.10-7.12). In the 
uncalibrated case, we do not know the calibration matrices Kj , so we cannot use the normal- 
ized ray directions Xj = KJ x Xj. Instead, we have access only to the image coordinates Xj, 
and so the essential matrix (7.10) becomes 

Xi Ex i = xj K^ t EKq 1 Xq = x^Fx o = 0, (7.28) 


where 

F=K^ T EK^ = [e\ x H (7.29) 

is called the fundamental matrix (Faugeras 1992; Hartley, Gupta, and Chang 1992; Hartley 
and Zisserman 2004). 

Like the essential matrix, the fundamental matrix is (in principle) rank two. 
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Its smallest left singular vector indicates the epipole e\ in the image 1 and its smallest right 
singular vector is e 0 (Figure 7.3). The homography H in (7.29), which in principle should 
equal 

H = K^RKo 1 , (7.31) 

cannot be uniquely recovered from F, since any homography of the form H = H + ev 
results in the same F matrix. (Note that [e] x annihilates any multiple of e.) 

Any one of these valid homographies H maps some plane in the scene from one image 
to the other. It is not possible to tell in advance which one it is without either selecting four 
or more co-planar correspondences to compute H as part of the F estimation process (in a 
manner analogous to guessing a rotation for E) or mapping all points in one image through 
H and seeing which ones line up with their corresponding locations in the other. 7 

In order to create a projective reconstruction of the scene, we can pick any valid homog- 
raphy H that satisfies Equation (7.29). For example, following a technique analogous to 
Equations (7.18-7.24), we get 

F = [e] x H = SZR^o S t H = UT,V t (7.32) 

and hence 

H = UR% Q o±V t , (7.33) 

where X is the singular value matrix with the smallest value replaced by a reasonable alter- 
native (say, the middle value). 8 We can then form a pair of camera matrices 

Po=[/|0] and Pq = [-Hie], (7.34) 

from which a projective reconstruction of the scene can be computed using triangulation 
(Section 7.1). 

While the projective reconstruction may not be useful in practice, it can often be upgraded 
to an affine or metric reconstruction, as detailed below. Even without this step, however, 
the fundamental matrix F can be very useful in finding additional correspondences, as they 
must all lie on corresponding epipolar lines, i.e., any feature Xq in image 0 must have its 
correspondence lying on the associated epipolar line l\ = Fx 0 in image 1, assuming that the 
point motions are due to a rigid transformation. 

7 This process is sometimes referred to as plane plus parallax (Section 2. 1 .5) (Kumar, Anandan, and Hanna 1994; 
Sawhney 1994). 

8 Hartley and Zisserman (2004, p. 237) recommend using H = \e] x F (Luong and Vieville 1996), which places 
the camera on the plane at infinity. 
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7.2.2 Self-calibration 

The results of structure from motion computation are much more useful (and intelligible) if 
a metric reconstruction is obtained, i.e., one in which parallel lines are parallel, orthogonal 
walls are at right angles, and the reconstructed model is a scaled version of reality. Over 
the years, a large number of self-calibration (or auto-calibration ) techniques have been de- 
veloped for converting a projective reconstruction into a metric one, which is equivalent to 
recovering the unknown calibration matrices Kj associated with each image (Hartley and 
Zisserman 2004; Moons, Van Gool, and Vergauwen 2010). 

In situations where certain additional information is known about the scene, different 
methods may be employed. For example, if there are parallel lines in the scene (usually, 
having several lines converge on the same vanishing point is good evidence), three or more 
vanishing points, which are the images of points at infinity, can be used to establish the ho- 
mography for the plane at infinity, from which focal lengths and rotations can be recovered. 
If two or more finite orthogonal vanishing points have been observed, the single-image cali- 
bration method based on vanishing points (Section 6.3.2) can be used instead. 

In the absence of such external information, it is not possible to recover a fully parameter- 
ized independent calibration matrix Kj for each image from correspondences alone. To see 
this, consider the set of all camera matrices P 3 = Kj[Rj\tj\ projecting world coordinates 
p t = (V,;. Y, . Z, , IT’, ) into screen coordinates x , 3 ~ PjPi ■ Now consider transforming the 
3D scene {p^} through an arbitrary 4x4 projective transformation H, yielding a new model 
consisting of points p' t = Hp, . Post-multiplying each Pj matrix by H still produces the 
same screen coordinates and a new set calibration matrices can be computed by applying RQ 
decomposition to the new camera matrix P'j = PjH 

For this reason, all self-calibration methods assume some restricted form of the calibration 
matrix, either by setting or equating some of their elements or by assuming that they do not 
vary over time. While most of the techniques discussed by Hartley and Zisserman (2004); 
Moons, Van Gool, and Vergauwen (2010) require three or more frames, in this section we 
present a simple technique that can recover the focal lengths (/o, /i) of both images from the 
fundamental matrix F in a two-frame reconstruction (Hartley and Zisserman 2004, p. 456). 

To accomplish this, we assume that the camera has zero skew, a known aspect ratio (usu- 
ally set to 1), and a known optical center, as in Equation (2.59). How reasonable is this 
assumption in practice? The answer, as with many questions, is “it depends”. 

If absolute metric accuracy is required, as in photogrammetry applications, it is imperative 
to pre-calibrate the cameras using one of the techniques from Section 6.3 and to use ground 
control points to pin down the reconstruction. If instead, we simply wish to reconstruct the 
world for visualization or image-based rendering applications, as in the Photo Tourism system 
of Snavely, Seitz, and Szeliski (2006), this assumption is quite reasonable in practice. 
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Most cameras today have square pixels and an optical center near the middle of the image, 
and are much more likely to deviate from a simple camera model due to radial distortion 
(Section 6.3.5), which should be compensated for whenever possible. The biggest problems 
occur when images have been cropped off-center, in which case the optical center will no 
longer be in the middle, or when perspective pictures have been taken of a different picture, 
in which case a general camera matrix becomes necessary. 9 

Given these caveats, the two-frame focal length estimation algorithm based on the Kruppa 
equations developed by Hartley and Zisserman (2004, p. 456) proceeds as follows. Take the 
left and right singular vectors { u () , U\ . vo, V \ } of the fundamental matrix F (7.30) and their 
associated singular values {cr 0 , <Ti) and form the following set of equations: 

ujDpUx _ U^DqUi _ U^DqUq 

alvlDiVo a 0 (JivlDiVi afvfD 1 v 1 : 

where the two matrices 


Dj = KjKj = diag(/ 2 , / 2 , 1) = 


f 2 

J j 


f 2 

J j 


(7.36) 


encode the unknown focal lengths. For simplicity, let us rewrite each of the numerators and 
denominators in (7.35) as 


6jjo(/o) — u i D oUj — ciij + bijf 0 , (7.37) 

e iji(/i) = cr^jvjD^j = + dijfl. (7.38) 

Notice that each of these is affine (linear plus constant) in either /g or f 'f. Hence, we 
can cross-multiply these equations to obtain quadratic equations in /?, which can readily 
be solved. (See also the work by Bougnoux (1998) for some alternative formulations.) 

An alternative solution technique is to observe that we have a set of three equations related 
by an unknown scalar A, i.e., 

eyo(/o) = A eiil (/i 2 ) (7.39) 

(Richard Hartley, personal communication, July 2009). These can readily be solved to yield 
(/o , A/i 2 , A) and hence (/ 0 , /i). 

How well does this approach work in practice? There are certain degenerate configura- 
tions, such as when there is no rotation or when the optical axes intersect, when it does not 
work at all. (In such a situation, you can vary the focal lengths of the cameras and obtain 

9 In Photo Tourism, our system registered photographs of an information sign outside Notre Dame with real 
pictures of the cathedral. 
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a deeper or shallower reconstruction, which is an example of a bas-relief ambiguity (Sec- 
tion 7.4.3).) Hartley and Zisserman (2004) recommend using techniques based on three or 
more frames. However, if you find two images for which the estimates of (/q , Xff, A) are 
well conditioned, they can be used to initialize a more complete bundle adjustment of all 
the parameters (Section 7.4). An alternative, which is often used in systems such as Photo 
Tourism, is to use camera EXIF tags or generic default values to initialize focal length esti- 
mates and refine them as part of bundle adjustment. 

7.2.3 Application : View morphing 

An interesting application of basic two-frame structure from motion is view morphing (also 
known as view interpolation, see Section 13.1), which can be used to generate a smooth 3D 
animation from one view of a 3D scene to another (Chen and Williams 1993; Seitz and Dyer 
1996). 

To create such a transition, you must first smoothly interpolate the camera matrices, i.e., 
the camera positions, orientations, and focal lengths. While simple linear interpolation can be 
used (representing rotations as quaternions (Section 2.1.4)), a more pleasing effect is obtained 
by easing in and easing out the camera parameters, e.g., using a raised cosine, as well as 
moving the camera along a more circular trajectory (Snavely, Seitz, and Szeliski 2006). 

To generate in-between frames, either a full set of 3D correspondences needs to be es- 
tablished (Section 11.3) or 3D models (proxies) must be created for each reference view. 
Section 13.1 describes several widely used approaches to this problem. One of the simplest 
is to just triangulate the set of matched feature points in each image, e.g., using Delaunay 
triangulation. As the 3D points are re-projected into their intermediate views, pixels can be 
mapped from their original source images to their new views using affine or projective map- 
ping (Szeliski and Shum 1997). The final image is then composited using a linear blend of 
the two reference images, as with usual morphing (Section 3.6.3). 

7.3 Factorization 

When processing video sequences, we often get extended feature tracks (Section 4.1.4) from 
which it is possible to recover the structure and motion using a process called factorization. 
Consider the tracks generated by a rotating ping pong ball, which has been marked with 
dots to make its shape and motion more discernable (Figure 7.5). We can readily see from 
the shape of the tracks that the moving object must be a sphere, but how can we infer this 
mathematically? 

It turns out that, under orthography or related models we discuss below, the shape and 
motion can be recovered simultaneously using a singular value decomposition (Tomasi and 
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Figure 7.5 3D reconstruction of a rotating ping pong ball using factorization (Tomasi and 
Kanade 1992) © 1992 Springer: (a) sample image with tracked features overlaid; (b) sub- 
sampled feature motion stream; (c) two views of the reconstructed 3D model. 


Kanade 1992). Consider the orthographic and weak perspective projection models introduced 
in Equations (2.47-2.49). Since the last row is always [000 1], there is no perspective division 
and we can write 

Xji = Pjpi, (7.40) 

where Xji is the location of the ith point in the jth frame, P t is the upper 2x4 portion of 
the projection matrix P ?1 and p i = (Xi,Yj, Z, . 1) is the augmented 3D point position. 10 

Let us assume (for now) that every point i is visible in every frame j. We can take the 
centroid (average) of the projected point locations Xji in frame j , 



l 


(7.41) 


where c = (X, Y, Z, 1) is the augmented 3D centroid of the point cloud. 

Since world coordinate frames in structure from motion are always arbitrary, i.e., we 
cannot recover true 3D locations without ground control points (known measurements), we 
can place the origin of the world at the centroid of the points, i.e, X = Y = Z = 0, so that 
c = (0, 0, 0, 1). We see from this that the centroid of the 2D points in each frame Xj directly 
gives us the last element of Pj. 

Let Xji = Xji — Xj be the 2D point locations after their image centroid has been sub- 
tracted. We can now write 

Xji = MjPi, (7.42) 

10 In this section, we index the 2D point positions as Xji instead of as%j, since this is the convention adopted by 
factorization papers (Tomasi and Kanade 1992) and is consistent with the factorization given in (7.43). 
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where Mj is the upper 2x3 portion of the projection matrix P 3 and p t = (X;. Y t , Z, ) . We 
can concatenate all of these measurement equations into one large matrix 



in • 

■ Xu ■ 

■ Xi N 


' Ml ' 

X = 

x jl ■ 

■ Xji ■ 

■ Xj N 
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Mj 


XMl ■ 

XMi 

■ xmn 


M m 


Pi ■■■ Pi ■■■ Pn 


= MS. 


(7.43) 


X is called the measurement matrix and M and (S are the motion) and structure matrices, 
respectively (Tomasi and Kanade 1992). 

Because the motion matrix M is 2 M x 3 and the structure matrix S is .'5 x N, an SVD 
applied to X has only three non-zero singular values. In the case where the measurements in 
X are noisy, SVD returns the rank-three factorization of X that is the closest to X in a least 
squares sense (Tomasi and Kanade 1992; Golub and Van Loan 1996; Hartley and Zisserman 
2004). 

It would be nice if the SVD of X = U'ZV 1 directly returned the matrices M and S, 
but it does not. Instead, we can write the relationship 

X = UT,V t = [UQ] [Q~ 1 Y,V t ] (7.44) 

and set M = UQ and S = Q^ 1 'EV T . U 

How can we recover the values of the 3x3 matrix Q ! This depends on the motion model 
being used. In the case of orthographic projection (2.47), the entries in Mj are the first two 
rows of rotation matrices Rj, so we have 


TYljo • TYljQ — 

U2jQQ 1 

= 1, 


TTljo • TXXj\ — 

u 2 jQQ u 2j+l 

= 0, 

(7.45) 

rriji ■ rriji = 

u 2j+lQQ u 2j+l 

= 1, 



where Uk are the 3x1 rows of the matrix U. This gives us a large set of equations for the 
entries in the matrix QQ 7 , from which the matrix Q can be recovered using a matrix square 
root (Appendix A. 1.4). If we have scaled orthography (2.48), i.e., Mj = SjRj, the first and 
third equations are equal to Sj and can be set equal to each other. 

Note that even once Q has been recovered, there still exists a bas-relief ambiguity, i.e., 
we can never be sure if the object is rotating left to right or if its depth reversed version is 
moving the other way. (This can be seen in the classic rotating Necker Cube visual illusion.) 

11 Tomasi and Kanade (1992) first take the square root of X and distribute this to U and V, but there is no 
particular reason to do this. 
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Additional cues, such as the appearance and disappearance of points, or perspective effects, 
both of which are discussed below, can be used to remove this ambiguity. 

For motion models other than pure orthography, e.g., for scaled orthography or para- 
perspective, the approach above must be extended in the appropriate manner. Such tech- 
niques are relatively straightforward to derive from first principles; more details can be found 
in papers that extend the basic factorization approach to these more flexible models (Poel- 
man and Kanade 1997). Additional extensions of the original factorization algorithm include 
multi-body rigid motion (Costeira and Kanade 1995), sequential updates to the factorization 
(Morita and Kanade 1997), the addition of lines and planes (Morris and Kanade 1998), and 
re-scaling the measurements to incorporate individual location uncertainties (Anandan and 
Irani 2002). 

A disadvantage of factorization approaches is that they require a complete set of tracks, 
i.e., each point must be visible in each frame, in order for the factorization approach to work. 
Tomasi and Kanade (1992) deal with this problem by first applying factorization to smaller 
denser subsets and then using known camera (motion) or point (structure) estimates to hallu- 
cinate additional missing values, which allows them to incrementally incorporate more fea- 
tures and cameras. Huynh, Hartley, and Heyden (2003) extend this approach to view missing 
data as special cases of outliers. Buchanan and Fitzgibbon (2005) develop fast iterative al- 
gorithms for performing large matrix factorizations with missing data. The general topic of 
principal component analysis (PCA) with missing data also appears in other computer vision 
problems (Shum, Ikeuchi, and Reddy 1995; De la Torre and Black 2003; Gross, Matthews, 
and Baker 2006; Torresani, Hertzmann, and Bregler 2008; Vidal, Ma, and Sastry 2010). 

7.3.1 Perspective and projective factorization 

Another disadvantage of regular factorization is that it cannot deal with perspective cameras. 
One way to get around this problem is to perform an initial affine (e.g., orthographic) recon- 
struction and to then correct for the perspective effects in an iterative manner (Christy and 
Horaud 1996). 

Observe that the object-centered projection model (2.76) 


xj ' Pi “1“ txj 
Xji = s 3i , 

1 + Vjr Z j • Pi 

(7.46) 

„ r yj ■ Pi + tyj 

yji * 7 -, . 

1 + 1 lj r zj • P, 

(7.47) 


differs from the scaled orthographic projection model (7.40) by the inclusion of the denomi- 
nator terms (1 + rjjr z j ■ Pi). n 


12 Assuming that the optical center (c x , c y ) lies at (0, 0) and that pixels are square. 
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If we knew the correct values of r/j = tZ- and the structure and motion parameters Rj and 
p t , we could cross-multiply the left hand side (visible point measurements x 7i and y 7l ) by the 
denominator and get corrected values, for which the bilinear projection model (7.40) is exact. 
In practice, after an initial reconstruction, the values of rjj can be estimated independently 
for each frame by comparing reconstructed and sensed point positions. (The third row of the 
rotation matrix r Z j is always available as the cross-product of the first two rows.) Note that 
since the rjj are determined from the image measurements, the cameras do not have to be 
pre -calibrated, i.e., their focal lengths can be recovered from f - 7 = Sj/rjj. 

Once the rjj have been estimated, the feature locations can then be corrected before apply- 
ing another round of factorization. Note that because of the initial depth reversal ambiguity, 
both reconstructions have to be tried while calculating rjj. (The incorrect reconstruction will 
result in a negative rjj, which is not physically meaningful.) Christy and Horaud (1996) report 
that their algorithm usually converges in three to five iterations, with the majority of the time 
spent in the SVD computation. 

An alternative approach, which does not assume partially calibrated cameras (known op- 
tical center, square pixels, and zero skew) is to perform a fully projective factorization (Sturm 
and Triggs 1996; Triggs 1996). In this case, the inclusion of the third row of the camera 
matrix in (7.40) is equivalent to multiplying each reconstructed measurement Xji = M jp 7 
by its inverse (projective) depth rjji = d~ 7 = l/{Pj 2 Pi) or, equivalently, multiplying each 
measured position by its projective depth dji, 


^ 11*^11 * ' ' dn^li * * * d\NX\N 


X = 


djiXji • • • djiXji * * • dj /v Xj /v 


= MS. 


(7.48) 


dMlXMl ■ ■ ■ dMi&Mi ' ' ' AmN^MN 


In the original paper by Sturm and Triggs (1996), the projective depths dji are obtained from 
two-frame reconstructions, while in later work (Triggs 1996; Oliensis and Hartley 2007), they 
are initialized to dji = 1 and updated after each iteration. Oliensis and Hartley (2007) present 
an update formula that is guaranteed to converge to a fixed point. None of these authors 
suggest actually estimating the third row of Pj as part of the projective depth computations. 
In any case, it is unclear when a fully projective reconstruction would be preferable to a 
partially calibrated one, especially if they are being used to initialize a full bundle adjustment 
of all the parameters. 

One of the attractions of factorization methods is that they provide a “closed form” (some- 
times called a “linear”) method to initialize iterative techniques such as bundle adjustment. 
An alternative initialization technique is to estimate the homographies corresponding to some 
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(c) (d) 

Figure 7.6 3D teacup model reconstructed from a 240-frame video sequence (Tomasi and 
Kanade 1992) © 1992 Springer: (a) first frame of video; (b) last frame of video; (c) side 
view of 3D model; (d) top view of 3D model. 




common plane seen by all the cameras (Rother and Carlsson 2002). In a calibrated camera 
setting, this can correspond to estimating consistent rotations for all of the cameras, for ex- 
ample, using matched vanishing points (Antone and Teller 2002). Once these have been 
recovered, the camera positions can then be obtained by solving a linear system (Antone and 
Teller 2002; Rother and Carlsson 2002; Rother 2003). 


7.3.2 Application : Sparse 3D model extraction 

Once a multi-view 3D reconstruction of the scene has been estimated, it then becomes possi- 
ble to create a texture-mapped 3D model of the object and to look at it from new directions. 

The first step is to create a denser 3D model than the sparse point cloud that structure 
from motion produces. One alternative is to run dense multi-view stereo (Sections 11.3- 
11.6). Alternatively, a simpler technique such as 3D triangulation can be used, as shown in 
Figure 7.6, in which 207 reconstructed 3D points are triangulated to produce a surface mesh. 

In order to create a more realistic model, a texture map can be extracted for each trian- 
gle face. The equations to map points on the surface of a 3D triangle to a 2D image are 
straightforward: just pass the local 2D coordinates on the triangle through the 3x4 camera 
projection matrix to obtain a 3 x 3 homography (planar perspective projection). When mul- 
tiple source images are available, as is usually the case in multi-view reconstruction, either 
the closest and most fronto-parallel image can be used or multiple images can be blended in 
to deal with view-dependent foreshortening (Wang, Kang, Szeliski el al. 2001) or to obtain 
super-resolved results (Goldluecke and Cremers 2009) Another alternative is to create a sep- 
arate texture map from each reference camera and to blend between them during rendering, 
which is known as view-dependent texture mapping (Section 13.1.1) (Debevec, Taylor, and 
Malik 1996; Debevec, Yu, and Borshukov 1998). 
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Figure 7.7 A set of chained transforms for projecting a 3D point p i into a 2D measure- 
ment Xij through a series of transformations f ik> , each of which is controlled by its own 
set of parameters. The dashed lines indicate the flow of information as partial derivatives 
are computed during a backward pass. The formula for the radial distortion function is 

/ RD (*) = (1 + «i r 2 + n 2 r 4 )x. 


7.4 Bundle adjustment 

As we have mentioned several times before, the most accurate way to recover structure and 
motion is to perform robust non-linear minimization of the measurement (re-projection) er- 
rors, which is commonly known in the photogrammetry (and now computer vision) commu- 
nities as bundle adjustment. 13 Triggs, McLauchlan, Hartley et al. (1999) provide an excellent 
overview of this topic, including its historical development, pointers to the photogrammetry 
literature (Slama 1980; Atkinson 1996; Kraus 1997), and subtle issues with gauge ambigu- 
ities. The topic is also treated in depth in textbooks and surveys on multi-view geometry 
(Faugeras and Luong 2001; Hartley and Zisserman 2004; Moons, Van Gool, and Vergauwen 
2010 ). 

We have already introduced the elements of bundle adjustment in our discussion on iter- 
ative pose estimation (Section 6.2.2), i.e.. Equations (6.42-6.48) and Figure 6.5. The biggest 
difference between these formulas and full bundle adjustment is that our feature location mea- 
surements now depend not only on the point (track index) i but also on the camera pose 
index j, 

Xij = f(Pi , Rj,Cj,Kj), (7.49) 

and that the 3D point positions p t are also being simultaneously updated. In addition, it is 
common to add a stage for radial distortion parameter estimation (2.78), 

/rd(*) = (1 + Kir 2 + K 2 r 4 )x, (7.50) 

if the cameras being used have not been pre-calibrated, as shown in Figure 7.7. 

13 The term ’’bundle” refers to the bundles of rays connecting camera centers to 3D points and the term ’’adjust- 
ment” refers to the iterative minimization of re-projection error. Alternative terms for this in the vision community 
include optimal motion estimation (Weng, Ahuja, and Huang 1993) and non-linear least squares (Appendix A. 3) 
(Taylor, Kriegman, and Anandan 1991; Szeliski and Kang 1994). 
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While most of the boxes (transforms) in Figure 7.7 have previously been explained (6.47), 
the leftmost box has not. This box performs a robust comparison of the predicted and mea- 
sured 2D locations x, :j and x, :l after re-scaling by the measurement noise covariance . In 
more detail, this operation can be written as 


T ij — - (7.51) 

4 = 4 > 4 '-^ < 7 -«> 

Cij = /5(4 )> ( 7 -53) 


where p(r 2 ) 


p(r). The corresponding Jacobians (partial derivatives) can be written as 
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(7.54) 
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The advantage of the chained representation introduced above is that it not only makes 
the computations of the partial derivatives and Jacobians simpler but it can also be adapted 
to any camera configuration. Consider for example a pair of cameras mounted on a robot 
that is moving around in the world, as shown in Figure 7.8a. By replacing the rightmost 
two transformations in Figure 7.7 with the transformations shown in Figure 7.8b, we can 
simultaneously recover the position of the robot at each time and the calibration of each 
camera with respect to the rig, in addition to the 3D structure of the world. 

7.4.1 Exploiting sparsity 

Large bundle adjustment problems, such as those involving reconstructing 3D scenes from 
thousands of Internet photographs (Snavely, Seitz, and Szeliski 2008b; Agarwal, Snavely, 
Simon et al. 2009; Agarwal, Furukawa, Snavely et al. 2010; Snavely, Simon, Goesele et al. 
2010), can require solving non-linear least squares problems with millions of measurements 
(feature matches) and tens of thousands of unknown parameters (3D point positions and cam- 
era poses). Unless some care is taken, these kinds of problem can become intractable, since 
the (direct) solution of dense least squares problems is cubic in the number of unknowns. 

Fortunately, structure from motion is a bipartite problem in structure and motion. Each 
feature point Xij in a given image depends on one 3D point position p i and one 3D camera 
pose (Rj, Cj). This is illustrated in Figure 7.9a, where each circle (1-9) indicates a 3D point, 
each square (A-D) indicates a camera, and lines (edges) indicate which points are visible in 
which cameras (2D features). If the values for all the points are known or fixed, the equations 
for all the cameras become independent, and vice versa. 
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Figure 7.8 A camera rig and its associated transform chain, (a) As the mobile rig (robot) 
moves around in the world, its pose with respect to the world at time t is captured by (R f . cjj). 
Each camera’s pose with respect to the rig is captured by {Rp cij). (b) A 3D point with world 
coordinates p f is first transformed into rig coordinates p\, and then through the rest of the 
camera-specific chain, as shown in Figure 7.7. 



(a) 


1 23456789ABCD 



Figure 7.9 (a) Bipartite graph for a toy structure from motion problem and (b) its associated 

Jacobian J and (c) Hessian A. Numbers indicate 3D points and letters indicate cameras. The 
dashed arcs and light blue squares indicate the fill-in that occurs when the structure (point) 
variables are eliminated. 
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If we order the structure variables before the motion variables in the Hessian matrix A 
(and hence also the right hand side vector b ), we obtain a structure for the Hessian shown in 
Figure 7.9c. 14 When such a system is solved using sparse Cholesky factorization (see Ap- 
pendix A.4) (Bjorck 1996; Golub and Van Loan 1996), the fill-in occurs in the smaller motion 
Hessian A cc (Szeliski and Kang 1994; Triggs, McLauchlan, Hartley et al. 1999; Hartley and 
Zisserman 2004; Lourakis and Argyros 2009; Engels, Stewenius, and Nister 2006). Some re- 
cent papers by (Byrod and pAstrom 2009), Jeong, Nister, Steedly et al. (2010) and (Agarwal, 
Snavely, Seitz et al. 2010) explore the use of iterative (conjugate gradient) techniques for the 
solution of bundle adjustment problems. 

In more detail, the reduced motion Hessian is computed using the Schur complement , 

-^■cc = -^-cc — -Ap C App Ap C , (7.56) 

where A pp is the point (structure) Hessian (the top left block of Figure 7.9c), A pc is the 
point-camera Hessian (the top right block), and A cc and A' cc are the motion Hessians before 
and after the point variable elimination (the bottom right block of Figure 7.9c). Notice that 
A' cc has a non-zero entry between two cameras if they see any 3D point in common. This is 
indicated with dashed arcs in Figure 7.9a and light blue squares in Figure 7.9c. 

Whenever there are global parameters present in the reconstruction algorithm, such as 
camera intrinsics that are common to all of the cameras, or camera rig calibration parameters 
such as those shown in Figure 7.8, they should be ordered last (placed along the right and 
bottom edges of A) in order to reduce fill-in. 

Engels, Stewenius, and Nister (2006) provide a nice recipe for sparse bundle adjustment, 
including all the steps needed to initialize the iterations, as well as typical computation times 
for a system that uses a fixed number of backward-looking frames in a real-time setting. They 
also recommend using homogeneous coordinates for the structure parameters p i , which is a 
good idea, since it avoids numerical instabilities for points near infinity. 

Bundle adjustment is now the standard method of choice for most structure-from-motion 
problems and is commonly applied to problems with hundreds of weakly calibrated images 
and tens of thousands of points, e.g., in systems such as Photosynth. (Much larger prob- 
lems are commonly solved in photogrammetry and aerial imagery, but these are usually care- 
fully calibrated and make use of surveyed ground control points.) However, as the problems 
become larger, it becomes impractical to re-solve full bundle adjustment problems at each 
iteration. 

One approach to dealing with this problem is to use an incremental algorithm, where new 
cameras are added over time. (This makes particular sense if the data is being acquired from 

14 This ordering is preferable when there are fewer cameras than 3D points, which is the usual case. The exception 
is when we are tracking a small number of points through many video frames, in which case this ordering should be 
reversed. 
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a video camera or moving vehicle (Nister, Naroditsky, and Bergen 2006; Pollefeys, Nister, 
Frahm et al. 2008).) A Kalman filter can be used to incrementally update estimates as new 
information is acquired. Unfortunately, such sequential updating is only statistically optimal 
for linear least squares problems. 

For non-linear problems such as structure from motion, an extended Kalman filter, which 
linearizes measurement and update equations around the current estimate, needs to be used 
(Gelb 1974; Vieville and Faugeras 1990). To overcome this limitation, several passes can 
be made through the data (Azarbayejani and Pentland 1995). Because points disappear from 
view (and old cameras become irrelevant), a variable state dimension filter (VSDF) can be 
used to adjust the set of state variables over time, for example, by keeping only cameras and 
point tracks seen in the last k frames (McLauchlan 2000). A more flexible approach to using 
a fixed number of frames is to propagate corrections backwards through points and cameras 
until the changes on parameters are below a threshold (Steedly and Essa 2001). Variants of 
these techniques, including methods that use a fixed window for bundle adjustment (Engels, 
Stewenius, and Nister 2006) or select keyframes for doing full bundle adjustment (Klein and 
Murray 2008) are now commonly used in real-time tracking and augmented-reality applica- 
tions, as discussed in Section 7.4.2. 

When maximum accuracy is required, it is still preferable to perform a full bundle ad- 
justment over all the frames. In order to control the resulting computational complexity, one 
approach is to lock together subsets of frames into locally rigid configurations and to optimize 
the relative positions of these cluster (Steedly, Essa, and Dellaert 2003). A different approach 
is to select a smaller number of frames to form a skeletal set that still spans the whole dataset 
and produces reconstructions of comparable accuracy (Snavely, Seitz, and Szeliski 2008b). 
We describe this latter technique in more detail in Section 7.4.4, where we discuss applica- 
tions of structure from motion to large image sets. 

While bundle adjustment and other robust non-linear least squares techniques are the 
methods of choice for most structure-from-motion problems, they suffer from initialization 
problems, i.e., they can get stuck in local energy minima if not started sufficiently close 
to the global optimum. Many systems try to mitigate this by being conservative in what 
reconstruction they perform early on and which cameras and points they add to the solution 
(Section 7.4.4). An alternative, however, is to re-formulate the problem using a norm that 
supports the computation of global optima. 

Kahl and Hartley (2008) describe techniques for using Loo norms in geometric recon- 
struction problems. The advantage of such norms is that globally optimal solutions can be 
efficiently computed using second-order cone programming (SOCP). The disadvantage is that 
Loo norms are particularly sensitive to outliers and so must be combined with good outlier 
rejection techniques before they can be used. 
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7.4.2 Application : Match move and augmented reality 

One of the neatest applications of structure from motion is to estimate the 3D motion of a 
video or film camera, along with the geometry of a 3D scene, in order to superimpose 3D 
graphics or computer-generated images (CGI) on the scene. In the visual effects industry, 
this is known as the match move problem (Roble 1999), since the motion of the synthetic 3D 
camera used to render the graphics must be matched to that of the real-world camera. For 
very small motions, or motions involving pure camera rotations, one or two tracked points can 
suffice to compute the necessary visual motion. For planar surfaces moving in 3D, four points 
are needed to compute the homography, which can then be used to insert planar overlays, e.g., 
to replace the contents of advertising billboards during sporting events. 

The general version of this problem requires the estimation of the full 3D camera pose 
along with the focal length (zoom) of the lens and potentially its radial distortion parameters 
(Roble 1999). When the 3D structure of the scene is known ahead of time, pose estima- 
tion techniques such as view correlation (Bogart 1991) or through-the-lens camera control 
(Gleicher and Witkin 1992) can be used, as described in Section 6.2.3. 

For more complex scenes, it is usually preferable to recover the 3D structure simultane- 
ously with the camera motion using structure-from-motion techniques. The trick with using 
such techniques is that in order to prevent any visible jitter between the synthetic graph- 
ics and the actual scene, features must be tracked to very high accuracy and ample feature 
tracks must be available in the vicinity of the insertion location. Some of today’s best known 
match move software packages, such as the boujou package from 2d3, 15 which won an Emmy 
award in 2002, originated in structure-from-motion research in the computer vision commu- 
nity (Fitzgibbon and Zisserman 1998). 

Closely related to the match move problem is robotics navigation, where a robot must es- 
timate its location relative to its environment, while simultaneously avoiding any dangerous 
obstacles. This problem is often known as simultaneous localization and mapping (SLAM) 
(Thrun, Burgard, and Fox 2005) or visual odometry (Levin and Szeliski 2004; Nister, Nar- 
oditsky, and Bergen 2006; Maimone, Cheng, and Matthies 2007). Early versions of such 
algorithms used range-sensing techniques, such as ultrasound, laser range finders, or stereo 
matching, to estimate local 3D geometry, which could then be fused into a 3D model. Newer 
techniques can perform the same task based purely on visual feature tracking, sometimes not 
even requiring a stereo camera rig (Davison, Reid, Molton el al. 2007). 

Another closely related application is augmented reality, where 3D objects are inserted 
into a video feed in real time, often to annotate or help users understand a scene (Azuma, 
Baillot, Behringer et al. 2001). While traditional systems require prior knowledge about the 
scene or object being visually tracked (Rosten and Drummond 2005), newer systems can 

15 


http://www.2d3.com/. 
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(a) (b) 

Figure 7.10 3D augmented reality: (a) Darth Vader and a horde of Ewoks battle it out 
on a table-top recovered using real-time, keyframe-based structure from motion (Klein and 
Murray 2007) © 2007 IEEE; (b) a virtual teapot is fixed to the top of a real-world coffee cup, 
whose pose is re -recognized at each time frame (Gordon and Lowe 2006) © 2007 Springer. 


simultaneously build up a model of the 3D environment and then track it, so that graphics can 
be superimposed. 

Klein and Murray (2007) describe a parallel tracking and mapping (PTAM) system, 
which simultaneously applies full bundle adjustment to keyframes selected from a video 
stream, while performing robust real-time pose estimation on intermediate frames. Fig- 
ure 7.10a shows an example of their system in use. Once an initial 3D scene has been 
reconstructed, a dominant plane is estimated (in this case, the table-top) and 3D animated 
characters are virtually inserted. Klein and Murray (2008) extend their previous system to 
handle even faster camera motion by adding edge features, which can still be detected even 
when interest points become too blurred. They also use a direct (intensity-based) rotation 
estimation algorithm for even faster motions. 

Instead of modeling the whole scene as one rigid reference frame, Gordon and Lowe 
(2006) first build a 3D model of an individual object using feature matching and structure 
from motion. Once the system has been initialized, for every new frame, they find the object 
and its pose using a 3D instance recognition algorithm, and then superimpose a graphical 
object onto that model, as shown in Figure 7.10b. 

While reliably tracking such objects and environments is now a well-solved problem, 
determining which pixels should be occluded by foreground scene elements still remains an 
open problem (Chuang, Agarwala, Curless et al. 2002; Wang and Cohen 2007a). 
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7.4.3 Uncertainty and ambiguities 

Because structure from motion involves the estimation of so many highly coupled parameters, 
often with no known “ground truth” components, the estimates produced by structure from 
motion algorithms can often exhibit large amounts of uncertainty (Szeliski and Kang 1997). 
An example of this is the classic bas-relief ambiguity, which makes it hard to simultaneously 
estimate the 3D depth of a scene and the amount of camera motion (Oliensis 2005). 16 

As mentioned before, a unique coordinate frame and scale for a reconstructed scene can- 
not be recovered from monocular visual measurements alone. (When a stereo rig is used, 
the scale can be recovered if we know the distance (baseline) between the cameras.) This 
seven-degree-of-freedom gauge ambiguity makes it tricky to compute the covariance matrix 
associated with a 3D reconstruction (Triggs, McLauchlan, Hartley el al. 1999; Kanatani and 
Morris 2001). A simple way to compute a covariance matrix that ignores the gauge freedom 
(indeterminacy) is to throw away the seven smallest eigenvalues of the information matrix (in- 
verse covariance), whose values are equivalent to the problem Hessian A up to noise scaling 
(see Section 6.1.4 and Appendix B.6). After we do this, the resulting matrix can be inverted 
to obtain an estimate of the parameter covariance. 

Szeliski and Kang (1997) use this approach to visualize the largest directions of variation 
in typical structure from motion problems. Not surprisingly, they find that (ignoring the gauge 
freedoms), the greatest uncertainties for problems such as observing an object from a small 
number of nearby viewpoints are in the depths of the 3D structure relative to the extent of the 
camera motion. 1 7 

It is also possible to estimate local or marginal uncertainties for individual parameters, 
which corresponds simply to taking block sub-matrices from the full covariance matrix. Un- 
der certain conditions, such as when the camera poses are relatively certain compared to 3D 
point locations, such uncertainty estimates can be meaningful. However, in many cases, indi- 
vidual uncertainty measures can mask the extent to which reconstruction errors are correlated, 
which is why looking at the first few modes of greatest joint variation can be helpful. 

The other way in which gauge ambiguities affect structure from motion and, in particular, 
bundle adjustment is that they make the system Hessian matrix A rank-deficient and hence 
impossible to invert. A number of techniques have been proposed to mitigate this problem 
(Triggs, McLauchlan, Hartley el al. 1999; Bartoli 2003). In practice, however, it appears that 
simply adding a small amount of the Hessian diagonal Adiag( A) to the Hessian A itself, as is 
done in the Levenberg-Marquardt non-linear least squares algorithm (Appendix A. 3), usually 

16 Bas-relief refers to a kind of sculpture in which objects, often on ornamental friezes, are sculpted with less 
depth than they actually occupy. When lit from above by sunlight, they appear to have true 3D depth because of the 
ambiguity between relative depth and the angle of the illuminant (Section 12.1.1). 

17 A good way to minimize the amount of such ambiguities is to use wide field of view cameras (Antone and 
Teller 2002; Levin and Szeliski 2006). 
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works well. 


7.4.4 Application : Reconstruction from Internet photos 

The most widely used application of structure from motion is in the reconstruction of 3D 
objects and scenes from video sequences and collections of images (Pollefeys and Van Gool 
2002). The last decade has seen an explosion of techniques for performing this task auto- 
matically without the need for any manual correspondence or pre-surveyed ground control 
points. A lot of these techniques assume that the scene is taken with the same camera and 
hence the images all have the same intrinsics (Fitzgibbon and Zisserman 1998; Koch, Polle- 
feys, and Van Gool 2000; Schaffalitzky and Zisserman 2002; Tuytelaars and Van Gool 2004; 
Pollefeys, Nister, Frahm et al. 2008; Moons, Van Gool, and Vergauwen 2010). Many of 
these techniques take the results of the sparse feature matching and structure from motion 
computation and then compute dense 3D surface models using multi-view stereo techniques 
(Section 1 1 .6) (Koch, Pollefeys, and Van Gool 2000; Pollefeys and Van Gool 2002; Pollefeys, 
Nister, Frahm et al. 2008; Moons, Van Gool, and Vergauwen 2010). 

The latest innovation in this space has been the application of structure from motion and 
multi-view stereo techniques to thousands of images taken from the Internet, where very little 
is known about the cameras taking the photographs (Snavely, Seitz, and Szeliski 2008a). Be- 
fore the structure from motion computation can begin, it is first necessary to establish sparse 
correspondences between different pairs of images and to then link such correspondences into 
feature tracks , which associate individual 2D image features with global 3D points. Because 
the 0(N 2 ) comparison of all pairs of images can be very slow, a number of techniques have 
been developed in the recognition community to make this process faster (Section 14.3.2) 
(Nister and Stewenius 2006; Philbin, Chum, Sivic et al. 2008; Li, Wu, Zach et al. 2008; 
Chum, Philbin, and Zisserman 2008; Chum and Matas 2010). 

To begin the reconstruction process, it is important to to select a good pair of images, 
where there are both a large number of consistent matches (to lower the likelihood of in- 
correct correspondences) and a significant amount of out-of-plane parallax, 18 to ensure that 
a stable reconstruction can be obtained (Snavely, Seitz, and Szeliski 2006). The EXIF tags 
associated with the photographs can be used to get good initial estimates for camera focal 
lengths, although this is not always strictly necessary, since these parameters are re-adjusted 
as part of the bundle adjustment process. 

Once an initial pair has been reconstructed, the pose of cameras that see a sufficient num- 
ber of the resulting 3D points can be estimated (Section 6.2) and the complete set of cameras 
and feature correspondences can be used to perform another round of bundle adjustment. Fig- 

1 8 A simple way to compute this is to robustly fit a homography to the correspondences and measure reprojection 
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Figure 7.11 Incremental structure from motion (Snavely, Seitz, and Szeliski 2006) © 2006 
ACM: Starting with an initial two-frame reconstruction of Trevi Fountain, batches of images 
are added using pose estimation, and their positions (along with the 3D model) are refined 
using bundle adjustment. 




ure 7.11 shows the progression of the incremental bundle adjustment algorithm, where sets of 
cameras are added after each successive round of bundle adjustment, while Figure 7.12 shows 
some additional results. An alternative to this kind of seed and grow approach is to first re- 
construct triplets of images and then hierarchically merge triplets into larger collections, as 
described by Fitzgibbon and Zisserman (1998). 

Unfortunately, as the incremental structure from motion algorithm continues to add more 
cameras and points, it can become extremely slow. The direct solution of a dense system 
of O(N) equations for the camera pose updates can take 0(N 3 ) time; while structure from 
motion problems are rarely dense, scenes such as city squares have a high percentage of 
cameras that see points in common. Re-running the bundle adjustment algorithm after every 
few camera additions results in a quartic scaling of the run time with the number of images 
in the dataset. One approach to solving this problem is to select a smaller number of images 
for the original scene reconstruction and to fold in the remaining images at the very end. 

Snavely, Seitz, and Szeliski (2008b) develop an algorithm for computing such a skele- 
tal set of images, which is guaranteed to produce a reconstruction whose error is within a 
bounded factor of the optimal reconstruction accuracy. Their algorithm first evaluates all 
pairwise uncertainties (position covariances) between overlapping images and then chains 
them together to estimate a lower bound for the relative uncertainty of any distant pair. The 
skeletal set is constructed so that the maximal uncertainty between any pair grows by no 
more than a constant factor. Figure 7.13 shows an example of the skeletal set computed for 
784 images of the Pantheon in Rome. As you can see, even though the skeletal set contains 
just a fraction of the original images, the shapes of the skeletal set and full bundle adjusted 
reconstructions are virtually indistinguishable. 

The ability to automatically reconstruct 3D models from large, unstructured image col- 
lections has opened a wide variety of additional applications, including the ability to automat- 
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Figure 7.12 3D reconstructions produced by the incremental structure from motion algo- 
rithm developed by Snavely, Seitz, and Szeliski (2006) © 2006 ACM: (a) cameras and point 
cloud from Trafalgar Square; (b) cameras and points overlaid on an image from the Great Wall 
of China; (c) overhead view of a reconstruction of the Old Town Square in Prague registered 
to an aerial photograph. 



(a) 
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(b) (c) 
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(e) 


Figure 7.13 Large scale structure from motion using skeletal sets (Snavely, Seitz, and 
Szeliski 2008b) © 2008 IEEE: (a) original match graph for 784 images; (b) skeletal set 
containing 101 images; (c) top-down view of scene (Pantheon) reconstructed from the skele- 
tal set; (d) reconstruction after adding in the remaining images using pose estimation; (e) final 
bundle adjusted reconstruction, which is almost identical. 
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ically find and label locations and regions of interest (Simon, Snavely, and Seitz 2007; Simon 
and Seitz 2008; Gammeter, Bossard, Quack et al. 2009) and to cluster large image collections 
so that they can be automatically labeled (Li, Wu, Zach et al. 2008; Quack, Leibe, and Van 
Gool 2008). Some of these application are discussed in more detail in Section 13.1.2. 


7.5 Constrained structure and motion 

The most general algorithms for structure from motion make no prior assumptions about the 
objects or scenes that they are reconstructing. In many cases, however, the scene contains 
higher-level geometric primitives, such as lines and planes. These can provide information 
complementary to interest points and also serve as useful building blocks for 3D modeling 
and visualization. Furthermore, these primitives are often arranged in particular relationships, 
i.e., many lines and planes are either parallel or orthogonal to each other. This is particularly 
true of architectural scenes and models, which we study in more detail in Section 12.6.1. 

Sometimes, instead of exploiting regularity in the scene structure, it is possible to take 
advantage of a constrained motion model. For example, if the object of interest is rotating 
on a turntable (Szeliski 1991b), i.e., around a fixed but unknown axis, specialized techniques 
can be used to recover this motion (Fitzgibbon, Cross, and Zisserman 1998). In other situa- 
tions, the camera itself may be moving in a fixed arc around some center of rotation (Shum 
and He 1999). Specialized capture setups, such as mobile stereo camera rigs or moving ve- 
hicles equipped with multiple fixed cameras, can also take advantage of the knowledge that 
individual cameras are (mostly) fixed with respect to the capture rig, as shown in Figure 7.8. 19 


7.5.1 Line-based techniques 

It is well known that pairwise epipolar geometry cannot be recovered from line matches 
alone, even if the cameras are calibrated. To see this, think of projecting the set of lines in 
each image into a set of 3D planes in space. You can move the two cameras around into any 
configuration you like and still obtain a valid reconstruction for 3D lines. 

When lines are visible in three or more views, the trifocal tensor can be used to transfer 
lines from one pair of images to another (Hartley and Zisserman 2004). The trifocal tensor 
can also be computed on the basis of line matches alone. 

Schmid and Zisserman (1997) describe a widely used technique for matching 2D lines 
based on the average of 15 x 15 pixel correlation scores evaluated at all pixels along their 

19 Because of mechanical compliance and jitter, it may be prudent to allow for a small amount of individual camera 
rotation around a nominal position. 
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Figure 7.14 Two images of a toy house along with their matched 3D line segments (Schmid 
and Zisserman 1997) © 1997 Springer. 

common line segment intersection. 20 In their system, the epipolar geometry is assumed to be 
known, e.g., computed from point matches. For wide baselines, all possible homographies 
corresponding to planes passing through the 3D line are used to warp pixels and the maximum 
correlation score is used. For triplets of images, the trifocal tensor is used to verify that 
the lines are in geometric correspondence before evaluating the correlations between line 
segments. Figure 7.14 shows the results of using their system. 

Bartoli and Sturm (2003) describe a complete system for extending three view relations 
(trifocal tensors) computed from manual line correspondences to a full bundle adjustment of 
all the line and camera parameters. The key to their approach is to use the Pliicker coor- 
dinates (2.12) to parameterize lines and to directly minimize reprojection errors. It is also 
possible to represent 3D line segments by their endpoints and to measure either the reprojec- 
tion error perpendicular to the detected 2D line segments in each image or the 2D errors using 
an elongated uncertainty ellipse aligned with the line segment direction (Szeliski and Kang 
1994). 

Instead of reconstructing 3D lines. Bay, Ferrari, and Van Gool (2005) use RANSAC to 
group lines into likely coplanar subsets. Four lines are chosen at random to compute a homog- 
raphy, which is then verified for these and other plausible line segment matches by evaluating 
color histogram-based correlation scores. The 2D intersection points of lines belonging to the 
same plane are then used as virtual measurements to estimate the epipolar geometry, which 
is more accurate than using the homographies directly. 

An alternative to grouping lines into coplanar subsets is to group lines by parallelism. 
Whenever three or more 2D lines share a common vanishing point, there is a good likelihood 
that they are parallel in 3D. By finding multiple vanishing points in an image (Section 4.3.3) 
and establishing correspondences between such vanishing points in different images, the rel- 
ative rotations between the various images (and often the camera intrinsics) can be directly 
estimated (Section 6.3.2). 

20 Because lines often occur at depth or orientation discontinuities, it may be preferable to compute correlation 
scores (or to match color histograms (Bay, Ferrari, and Van Gool 2005)) separately on each side of the line. 
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Shum, Han, and Szeliski (1998) describe a 3D modeling system which first constructs 
calibrated panoramas from multiple images (Section 7.4) and then has the user draw vertical 
and horizontal lines in the image to demarcate the boundaries of planar regions. The lines 
are initially used to establish an absolute rotation for each panorama and are later used (along 
with the inferred vertices and planes) to infer a 3D structure, which can be recovered up to 
scale from one or more images (Figure 12.15). 

A fully automated approach to line-based structure from motion is presented vy Werner 
and Zisserman (2002). In their system, they first find lines and group them by common van- 
ishing points in each image (Section 4.3.3). The vanishing points are then used to calibrate the 
camera, i.e., to performa a “metric upgrade” (Section 6.3.2). Lines corresponding to common 
vanishing points are then matched using both appearance (Schmid and Zisserman 1997) and 
trifocal tensors. The resulting set of 3D lines, color coded by common vanishing directions 
(3D orientations) is shown in Figure 12.16a. These lines are then used to infer planes and a 
block-structured model for the scene, as described in more detail in Section 12.6.1. 


7.5.2 Plane-based techniques 

In scenes that are rich in planar structures, e.g., in architecture and certain kinds of manu- 
factured objects such as furniture, it is possible to directly estimate homographies between 
different planes, using either feature-based or intensity-based methods. In principle, this in- 
formation can be used to simultaneously infer the camera poses and the plane equations, i.e., 
to compute plane-based structure from motion. 

Luong and Faugeras (1996) show how a fundamental matrix can be directly computed 
from two or more homographies using algebraic manipulations and least squares. Unfortu- 
nately, this approach often performs poorly, since the algebraic errors do not correspond to 
meaningful reprojection errors (Szeliski and Torr 1998). 

A better approach is to hallucinate virtual point correspondences within the areas from 
which each homography was computed and to feed them into a standard structure from mo- 
tion algorithm (Szeliski and Torr 1998). An even better approach is to use full bundle adjust- 
ment with explicit plane equations, as well as additional constraints to force reconstructed 
co-planar features to lie exactly on their corresponding planes. (A principled way to do this 
is to establish a coordinate frame for each plane, e.g., at one of the feature points, and to use 
2D in-plane parameterizations for the other points.) The system developed by Shum, Han, 
and Szeliski (1998) shows an example of such an approach, where the directions of lines and 
normals for planes in the scene are pre-specified by the user. 
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7.6 Additional reading 

The topic of structure from motion is extensively covered in books and review articles on 
multi-view geometry (Faugeras and Luong 2001; Hartley and Zisserman 2004; Moons, Van 
Gool, and Vergauwen 2010). For two-frame reconstruction. Hartley (1997a) wrote a highly 
cited paper on the “eight-point algorithm” for computing an essential or fundamental ma- 
trix with reasonable point normalization. When the cameras are calibrated, the five-point 
algorithm of Nister (2004) can be used in conjunction with RANSAC to obtain initial recon- 
structions from the minimum number of points. When the cameras are uncalibrated, various 
self-calibration techniques can be found in work by Hartley and Zisserman (2004); Moons, 
Van Gool, and Vergauwen (2010) — I only briefly mention one of the simplest techniques, the 
Kruppa equations (7.35). 

In applications where points are being tracked from frame to frame, factorization tech- 
niques, based on either orthographic camera models (Tomasi and Kanade 1992; Poelman 
and Kanade 1997; Costeira and Kanade 1995; Morita and Kanade 1997; Morris and Kanade 
1998; Anandan and Irani 2002) or projective extensions (Christy and Horaud 1996; Sturm 
and Triggs 1996; Triggs 1996; Oliensis and Hartley 2007), can be used. 

Triggs, McLauchlan, Hartley et al. (1999) provide a good tutorial and survey on bundle 
adjustment, while Lourakis and Argyros (2009) and Engels, Stewenius, and Nister (2006) 
provide tips on implementation and effective practices. Bundle adjustment is also covered 
in textbooks and surveys on multi-view geometry (Faugeras and Luong 2001; Hartley and 
Zisserman 2004; Moons, Van Gool, and Vergauwen 2010). Techniques for handling larger 
problems are described by Snavely, Seitz, and Szeliski (2008b); Agarwal, Snavely, Simon 
et al. (2009); Jeong, Nister, Steedly et al. (2010); Agarwal, Snavely, Seitz et al. (2010). 
While bundle adjustment is often called as an inner loop inside incremental reconstruction 
algorithms (Snavely, Seitz, and Szeliski 2006), hierarchical (Fitzgibbon and Zisserman 1998; 
Farenzena, Fusiello, and Gherardi 2009) and global (Rother and Carlsson 2002; Martinec and 
Pajdla 2007) approaches for initialization are also possible and perhaps even preferable. 

As structure from motion starts being applied to dynamic scenes, the topic of non-rigid 
structure from motion (Torresani, Hertzmann, and Bregler 2008), which we do not cover in 
this book, will become more important. 


7.7 Exercises 

Ex 7.1: Triangulation Use the calibration pattern you built and tested in Exercise 6.7 to 
test your triangulation accuracy. As an alternative, generate synthetic 3D points and cameras 
and add noise to the 2D point measurements. 
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1. Assume that you know the camera pose, i.e., the camera matrices. Use the 3D distance 
to rays (7.4) or linearized versions of Equations (7. 5-7. 6) to compute an initial set of 
3D locations. Compare these to your known ground truth locations. 

2. Use iterative non-linear minimization to improve your initial estimates and report on 
the improvement in accuracy. 

3. (Optional) Use the technique described by Hartley and Sturm (1997) to perform two- 
frame triangulation. 

4. See if any of the failure modes reported by Hartley and Sturm (1997) or Hartley (1998) 
occur in practice. 

Ex 7.2: Essential and fundamental matrix Implement the two-frame E and F matrix es- 
timation techniques presented in Section 7.2, with suitable re-scaling for better noise immu- 
nity. 

1. Use the data from Exercise 7.1 to validate your algorithms and to report on their accu- 
racy. 

2. (Optional) Implement one of the improved F or E estimation algorithms, e.g., us- 
ing renormalization (Zhang 1998b; Torr and Fitzgibbon 2004; Hartley and Zisserman 
2004), RANSAC (Torr and Murray 1997), least media squares (LMS), or the five-point 
algorithm developed by Nister (2004). 

Ex 7.3: View morphing and interpolation Implement automatic view morphing, i.e., com- 
pute two-frame structure from motion and then use these results to generate a smooth anima- 
tion from one image to the next (Section 7.2.3). 

1. Decide how to represent your 3D scene, e.g., compute a Delaunay triangulation of the 
matched point and decide what to do with the triangles near the border. (Hint: try fitting 
a plane to the scene, e.g., behind most of the points.) 

2. Compute your in-between camera positions and orientations. 

3. Warp each triangle to its new location, preferably using the correct perspective projec- 
tion (Szeliski and Shum 1997). 

4. (Optional) If you have a denser 3D model (e.g., from stereo), decide what to do at the 
“cracks”. 

5. (Optional) For a non-rigid scene, e.g., two pictures of a face with different expressions, 
not all of your matched points will obey the epipolar geometry. Decide how to handle 
them to achieve the best effect. 
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Ex 7.4: Factorization Implement the factorization algorithm described in Section 7.3 us- 
ing point tracks you computed in Exercise 4.5. 

1 . (Optional) Implement uncertainty rescaling (Anandan and Irani 2002) and comment on 
whether this improves your results. 

2. (Optional) Implement one of the perspective improvements to factorization discussed 
in Section 7.3.1 (Christy and Horaud 1996; Sturm and Triggs 1996; Triggs 1996). Does 
this produce significantly lower reprojection errors? Can you upgrade this reconstruc- 
tion to a metric one? 

Ex 7.5: Bundle adjuster Implement a full bundle adjuster. This may sound daunting, but 
it really is not. 

1. Devise the internal data structures and external file representations to hold your camera 
parameters (position, orientation, and focal length), 3D point locations (Euclidean or 
homogeneous), and 2D point tracks (frame and point identifier as well as 2D locations). 

2. Use some other technique, such as factorization, to initialize the 3D point and camera 
locations from your 2D tracks (e.g., a subset of points that appears in all frames). 

3. Implement the code corresponding to the forward transformations in Figure 7.7, i.e., 
for each 2D point measurement, take the corresponding 3D point, map it through the 
camera transformations (including perspective projection and focal length scaling), and 
compare it to the 2D point measurement to get a residual error. 

4. Take the residual error and compute its derivatives with respect to all the unknown 
motion and structure parameters, using backward chaining, as shown, e.g., in Figure 7.7 
and Equation (6.47). This gives you the sparse Jacobian J used in Equations (6.13- 
6.17) and Equation (6.43). 

5. Use a sparse least squares or linear system solver, e.g., MATLAB, SparseSuite, or 
SPARS KIT (see Appendix A. 4 and A. 5), to solve the corresponding linearized system, 
adding a small amount of diagonal preconditioning, as in Levenberg-Marquardt. 

6. Update your parameters, make sure your rotation matrices are still orthonormal (e.g., 
by re-computing them from your quaternions), and continue iterating while monitoring 
your residual error. 

7. (Optional) Use the “Schur complement trick” (7.56) to reduce the size of the system 
being solved (Triggs, McLauchlan, Hartley el al. 1999; Hartley and Zisserman 2004; 
Lourakis and Argyros 2009; Engels, Stewenius, and Nister 2006). 
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8. (Optional) Implement your own iterative sparse solver, e.g., conjugate gradient, and 
compare its performance to a direct method. 

9. (Optional) Make your bundle adjuster robust to outliers, or try adding some of the other 
improvements discussed in (Engels, Stewenius, and Nister 2006). Can you think of any 
other ways to make your algorithm even faster or more robust? 

Ex 7.6: Match move and augmented reality Use the results of the previous exercise to 
superimpose a rendered 3D model on top of video. See Section 7.4.2 for more details and 
ideas. Check for how “locked down” the objects are. 

Ex 7.7: Line-based reconstruction Augment the previously developed bundle adjuster to 
include lines, possibly with known 3D orientations. 

Optionally, use co-planar sets of points and lines to hypothesize planes and to enforce 
co-planarity (Schaffalitzky and Zisserman 2002; Robertson and Cipolla 2002) 

Ex 7.8: Flexible bundle adjuster Design a bundle adjuster that allows for arbitrary chains 
of transformations and prior knowledge about the unknowns, as suggested in Figures 7. 7-7. 8. 

Ex 7.9: Unordered image matching Compute the camera pose and 3D structure of a scene 
from an arbitrary collection of photographs (Brown and Lowe 2003; Snavely, Seitz, and 
Szeliski 2006). 
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Figure 8.1 Motion estimation: (a-b) regularization-based optical flow (Nagel and Enkel- 
mann 1986) © 1986 IEEE; (c-d) layered motion estimation (Wang and Adelson 1994) © 
1994 IEEE; (e-f) sample image and ground truth flow from evaluation database (Baker, 
Black, Lewis et al. 2007) © 2007 IEEE. 
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Algorithms for aligning images and estimating motion in video sequences are among the most 
widely used in computer vision. For example, frame-rate image alignment is widely used in 
camcorders and digital cameras to implement their image stabilization (IS) feature. 

An early example of a widely used image registration algorithm is the patch-based trans- 
lational alignment (optical flow) technique developed by Lucas and Kanade (1981). Variants 
of this algorithm are used in almost all motion-compensated video compression schemes 
such as MPEG and H.263 (Le Gall 1991). Similar parametric motion estimation algorithms 
have found a wide variety of applications, including video summarization (Teodosio and 
Bender 1993; Irani and Anandan 1998), video stabilization (Hansen, Anandan, Dana et al. 
1994; Srinivasan, Chellappa, Veeraraghavan et al. 2005; Matsushita, Ofek, Ge et al. 2006), 
and video compression (Irani, Hsu, and Anandan 1995; Lee, ge Chen, lung Bruce Lin et 
al. 1997). More sophisticated image registration algorithms have also been developed for 
medical imaging and remote sensing. Image registration techniques are surveyed by Brown 
(1992), Zitov’aa and Flusser (2003), Goshtasby (2005), and Szeliski (2006a). 

To estimate the motion between two or more images, a suitable error metric must first 
be chosen to compare the images (Section 8.1). Once this has been established, a suitable 
search technique must be devised. The simplest technique is to exhaustively try all possible 
alignments, i.e., to do a full search. In practice, this may be too slow, so hierarchical coarse- 
to-fine techniques (Section 8.1.1) based on image pyramids are normally used. Alternatively, 
Fourier transforms (Section 8.1.2) can be used to speed up the computation. 

To get sub-pixel precision in the alignment, incremental methods (Section 8.1.3) based 
on a Taylor series expansion of the image function are often used. These can also be applied 
to parametric motion models (Section 8.2), which model global image transformations such 
as rotation or shearing. Motion estimation can be made more reliable by learning the typi- 
cal dynamics or motion statistics of the scenes or objects being tracked, e.g., the natural gait 
of walking people (Section 8.2.2). For more complex motions, piecewise parametric spline 
motion models (Section 8.3) can be used. In the presence of multiple independent (and per- 
haps non-rigid) motions, general-purpose optical flow (or optic flow) techniques need to be 
used (Section 8.4). For even more complex motions that include a lot of occlusions, layered 
motion models (Section 8.5), which decompose the scene into coherently moving layers, can 
work well. 

In this chapter, we describe each of these techniques in more detail. Additional details 
can be found in review and comparative evaluation papers on motion estimation (Barron, 
Fleet, and Beauchemin 1994; Mitiche and Bouthemy 1996; Stiller and Konrad 1999; Szeliski 
2006a; Baker, Black, Lewis et al. 2007). 
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8.1 Translational alignment 

The simplest way to establish an alignment between two images or image patches is to shift 
one image relative to the other. Given a template image Iq(x) sampled at discrete pixel 
locations {a = (a;*, yf)}, we wish to find where it is located in image I\{x). A least squares 
solution to this problem is to find the minimum of the sum of squared differences (SSD) 
function 

Essd(u) = + u)~ I 0 (xi)} 2 = ^2 ef, (8.1) 

i i 

where u = (u,v) is the displacement and e, = I\(xi + u) — Io(xi) is called the residual 
error (or the displaced frame difference in the video coding literature). 1 (We ignore for the 
moment the possibility that parts of Iq may lie outside the boundaries of I\ or be otherwise 
not visible.) The assumption that corresponding pixel values remain the same in the two 
images is often called the brightness constancy constraint ? 

In general, the displacement u can be fractional, so a suitable interpolation function must 
be applied to image I\ (x). In practice, a bilinear interpolant is often used but bicubic inter- 
polation can yield slightly better results (Szeliski and Scharstein 2004). Color images can be 
processed by summing differences across all three color channels, although it is also possible 
to first transform the images into a different color space or to only use the luminance (which 
is often done in video encoders). 

Robust error metrics. We can make the above error metric more robust to outliers by re- 
placing the squared error terms with a robust function p(e f) (Huber 1981; Hampel, Ronchetti, 
Rousseeuw el al. 1986; Black and Anandan 1996; Stewart 1999) to obtain 

Esrb{u) = ^2p(h(xi + u) - I 0 (xi)) = ^2 p{ e i)- (8.2) 

i i 

The robust norm p(e) is a function that grows less quickly than the quadratic penalty associ- 
ated with least squares. One such function, sometimes used in motion estimation for video 
coding because of its speed, is the sum of absolute differences (SAD) metric 3 or L \ norm, 
i.e., 

Esad{u) = ^2 \E{Xi + u) - I 0 (Xi ) I = ^2 N- (8-3) 

i i 

1 The usual justification for using least squares is that it is the optimal estimate with respect to Gaussian noise. 
See the discussion below on robust error metrics as well as Appendix B.3. 

2 Brightness constancy (Horn 1974) is the tendency for objects to maintain their perceived brightness under 
varying illumination conditions. 

3 In video compression, e.g., the H.264 standard (http://www.itu.int/rec/T-REC-H.264), the sum of absolute trans- 
formed differences (SATD), which measures the differences in a frequency transform space, e.g., using a Hadamard 
transform, is often used since it more accurately predicts quality (Richardson 2003). 
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However, since this function is not differentiable at the origin, it is not well suited to gradient- 
descent approaches such as the ones presented in Section 8.1.3. 

Instead, a smoothly varying function that is quadratic for small values but grows more 
slowly away from the origin is often used. Black and Rangarajan (1996) discuss a variety of 
such functions, including the Geman-McClure function. 


where a is a constant that can be thought of as an outlier threshold. An appropriate value for 
the threshold can itself be derived using robust statistics (Huber 1981; Hampel, Ronchetti, 
Rousseeuw el al. 1986; Rousseeuw and Leroy 1987), e.g., by computing the median absolute 
deviation, MAD = medj|ej|, and multiplying it by 1.4 to obtain a robust estimate of the 
standard deviation of the inlier noise process (Stewart 1999). 

Spatially varying weights. The error metrics above ignore that fact that for a given align- 
ment, some of the pixels being compared may lie outside the original image boundaries. 
Furthermore, we may want to partially or completely downweight the contributions of cer- 
tain pixels. For example, we may want to selectively “erase” some parts of an image from 
consideration when stitching a mosaic where unwanted foreground objects have been cut out. 
For applications such as background stabilization, we may want to downweight the middle 
part of the image, which often contains independently moving objects being tracked by the 
camera. 

All of these tasks can be accomplished by associating a spatially varying per-pixel weight 
value with each of the two images being matched. The error metric then becomes the 
weighted (or windowed ) SSD function. 


where the weighting functions wq and uq are zero outside the image boundaries. 

If a large range of potential motions is allowed, the above metric can have a bias towards 
smaller overlap solutions. To counteract this bias, the windowed SSD score can be divided 
by the overlap area 


to compute a per-pixel (or mean) squared pixel error L’wssd /A. The square root of this 
quantity is the root mean square intensity error 



(8.4) 


^wssd(m) = ^Wotx^wxtx,, + u)[I 1 (x i + u) - /o(aq)] 2 , (8.5) 



( 8 . 6 ) 


RMS = \J 7?wssdM 


(8.7) 


often reported in comparative studies. 
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Bias and gain (exposure differences). Often, the two images being aligned were not taken 
with the same exposure. A simple model of linear (affine) intensity variation between the two 
images is the bias and gain model, 

h(x + u) = (1 + a)I 0 (x) + (3, (8.8) 

where (3 is the bias and a is the gain (Lucas and Kanade 1981; Gennert 1988; Fuh and 
Maragos 1991; Baker, Gross, and Matthews 2003; Evangelidis and Psarakis 2008). The least 
squares formulation then becomes 

E B g(u) = Y^[Ii(xi + u) - (1 + a)Io(xi) - (3} 2 = ^ [«/<)(**) + P - e if- (8-9) 

i i 

Rather than taking a simple squared difference between corresponding patches, it becomes 
necessary to perform a linear regression (Appendix A. 2), which is somewhat more costly. 
Note that for color images, it may be necessary to estimate a different bias and gain for each 
color channel to compensate for the automatic color correction performed by some digital 
cameras (Section 2.3.2). Bias and gain compensation is also used in video codecs, where it is 
known as weighted prediction (Richardson 2003). 

A more general (spatially varying, non-parametric) model of intensity variation, which is 
computed as part of the registration process, is used in (Negahdaripour 1998; Jia and Tang 
2003; Seitz and Baker 2009). This can be useful for dealing with local variations such as 
the vignetting caused by wide-angle lenses, wide apertures, or lens housings. It is also pos- 
sible to pre-process the images before comparing their values, e.g., using band-pass filtered 
images (Anandan 1989; Bergen, Anandan, Hanna et al. 1992), gradients (Scharstein 1994; 
Papenberg, Bruhn, Brox et al. 2006), or using other local transformations such as histograms 
or rank transforms (Cox, Roy, and Hingorani 1995; Zabih and Woodfill 1994), or to max- 
imize mutual information (Viola and Wells III 1997; Kim, Kolmogorov, and Zabih 2003). 
Hirschmiiller and Scharstein (2009) compare a number of these approaches and report on 
their relative performance in scenes with exposure differences. 

Correlation. An alternative to taking intensity differences is to perform correlation, i.e., to 
maximize the product (or cross-correlation ) of the two aligned images, 

E C c(u) ='^2l 0 {x i )Ii{x i + u). (8.10) 

i 

At first glance, this may appear to make bias and gain modeling unnecessary, since the images 
will prefer to line up regardless of their relative scales and offsets. However, this is actually 
not true. If a very bright patch exists in l \ (x). the maximum product may actually lie in that 


area. 
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For this reason, normalized cross-correlation is more commonly used, 

- Io\ [h{xj + u)~ h\ 



( 8 . 11 ) 


V^E i [- r o(*i) - Io] 2 y/'E i [Ii(xi + «) - h } 2 


where 


Io = and 


( 8 . 12 ) 


l 



(8.13) 


are the mean images of the corresponding patches and N is the number of pixels in the patch. 
The normalized cross-correlation score is always guaranteed to be in the range [—1,1], which 
makes it easier to handle in some higher-level applications, such as deciding which patches 
truly match. Normalized correlation works well when matching images taken with different 
exposures, e.g., when creating high dynamic range images (Section 10.2). Note, however, 
that the NCC score is undefined if either of the two patches has zero variance (and, in fact, its 
performance degrades for noisy low-contrast regions). 

A variant on NCC, which is related to the bias-gain regression implicit in the matching 
score (8.9), is the normalized SSD score 


recently proposed by Criminisi, Shotton, Blake et al. (2007). In their experiments, they find 
that it produces comparable results to NCC, but is more efficient when applied to a large 
number of overlapping patches using a moving average technique (Section 3.2.2). 

8.1.1 Hierarchical motion estimation 

Now that we have a well-defined alignment cost function to optimize, how can we find its 
minimum? The simplest solution is to do a full search over some range of shifts, using ei- 
ther integer or sub-pixel steps. This is often the approach used for block matching in motion 
compensated video compression , where a range of possible motions (say, ±16 pixels) is ex- 
plored. 4 

To accelerate this search process, hierarchical motion estimation is often used: an image 
pyramid (Section 3.5) is constructed and a search over a smaller number of discrete pixels 

4 In stereo matching (Section 1 1.1.2), an explicit search over all possible disparities (i.e., a plane sweep) is almost 
always performed, since the number of search hypotheses is much smaller due to the ID nature of the potential 
displacements. 


-^nssd(m) = 



1 [UopKi) --To] - [Il{Xi + u) - h]Y 

2 J'Ei^oiXi) - Jo ] 2 + [h{Xi + u) - h ] 2 


(8.14) 
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(corresponding to the same range of motion) is first performed at coarser levels (Quam 1984; 
Anandan 1989; Bergen, Anandan, Hanna et al. 1992). The motion estimate from one level 
of the pyramid is then used to initialize a smaller local search at the next finer level. Al- 
ternatively, several seeds (good solutions) from the coarse level can be used to initialize the 
fine-level search. While this is not guaranteed to produce the same result as a full search, it 
usually works almost as well and is much faster. 

More formally, let 

be the decimated image at level l obtained by subsampling ( downsampling ) a smoothed ver- 
sion of the image at level / — 1 . See Section 3.5 for how to perform the required downsampling 
(pyramid construction) without introducing too much aliasing. 

At the coarsest level, we search for the best displacement that minimizes the dif- 
ference between images I^ 1 and l[ l \ This is usually done using a full search over some 
range of displacements u' ,] £ 2 1 [—5, S] 2 , where S is the desired search range at the finest 
(original) resolution level, optionally followed by the incremental refinement step described 
in Section 8.1.3. 

Once a suitable motion vector has been estimated, it is used to predict a likely displace- 
ment 

u (l ~ 1] <- 2w (z) (8.16) 

for the next finer level. 5 The search over displacements is then repeated at the finer level over 
a much narrower range of displacements, say ii" 1 ' ± 1, again optionally combined with an 
incremental refinement step (Anandan 1989). Alternatively, one of the images can be warped 
(resampled) by the current motion estimate, in which case only small incremental motions 
need to be computed at the finer level. A nice description of the whole process, extended to 
parametric motion estimation (Section 8.2), is provided by Bergen, Anandan, Hanna el al. 
(1992). 

8.1.2 Fourier-based alignment 

When the search range corresponds to a significant fraction of the larger image (as is the case 
in image stitching, see Chapter 9), the hierarchical approach may not work that well, since 
it is often not possible to coarsen the representation too much before significant features are 
blurred away. In this case, a Fourier-based approach may be preferable. 

5 This doubling of displacements is only necessary if displacements are defined in integer pixel coordinates, 
which is the usual case in the literature (Bergen, Anandan, Hanna et al. 1992). If normalized device coordinates 
(Section 2.1.5) are used instead, the displacements (and search ranges) need not change from level to level, although 
the step sizes will need to be adjusted, to keep search steps of roughly one pixel. 
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Fourier-based alignment relies on the fact that the Fourier transform of a shifted signal 
has the same magnitude as the original signal but a linearly varying phase (Section 3.4), i.e., 

T{h(x + u)} =F{h(x)}e~ ju -“ =h{u)e- ju -“, (8.17) 

where u> is the vector-valued angular frequency of the Fourier transform and we use cal- 
ligraphic notation Ii (<*>) = T { l-\ (x) } to denote the Fourier transform of a signal (Sec- 
tion 3.4). 

Another useful property of Fourier transforms is that convolution in the spatial domain 
corresponds to multiplication in the Fourier domain (Section 3. 4). 6 Thus, the Fourier trans- 
form of the cross-correlation function Eqc can be written as 

^{Scc(«)} = J 7 |X]i'o(*i)/i(*i + «)| (8.18) 

where 

f(u)*g{u) = Y f(xi)g(xi + u) (8.19) 

i 

is the correlation function, i.e., the convolution of one signal with the reverse of the other, 
and X\ (u>) is the complex conjugate of 2i(u>). This is because convolution is defined as the 
summation of one signal with the reverse of the other (Section 3.4). 

Thus, to efficiently evaluate Eqc over the range of all possible values of u, we take the 
Fourier transforms of both images Iq(x) and I \ (x), multiply both transforms together (after 
conjugating the second one), and take the inverse transform of the result. The Fast Fourier 
Transform algorithm can compute the transform of an TV x M image in O (NM log NM) 
operations (Bracewell 1986). This can be significantly faster than the 0(N 2 M 2 ) operations 
required to do a full search when the full range of image overlaps is considered. 

While Fourier-based convolution is often used to accelerate the computation of image 
correlations, it can also be used to accelerate the sum of squared differences function (and its 
variants). Consider the SSD formula given in (8.1). Its Fourier transform can be written as 

E{E SS v(u)} = ^ j£[J 1 (® i + u)-J 0 (®i)] 2 

= 6(u>) Yti ?(*i) + ~ 2J 0 (o ;)2T(o>). (8.20) 

i 

Thus, the SSD function can be computed by taking twice the correlation function and sub- 
tracting it from the sum of the energies in the two images. 

6 In fact, the Fourier shift property (8.17) derives from the convolution theorem by observing that shifting is 

equivalent to convolution with a displaced delta function 5{x — u). 
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Windowed correlation. Unfortunately, the Fourier convolution theorem only applies when 
the summation over x 7 is performed over all the pixels in both images, using a circular shift 
of the image when accessing pixels outside the original boundaries. While this is acceptable 
for small shifts and comparably sized images, it makes no sense when the images overlap by 
a small amount or one image is a small subset of the other. 

In that case, the cross-correlation function should be replaced with a windowed (weighted) 
cross-correlation function, 

-Ewcc(m) = y^,wo(xi)Io(xi) Wi(xj + u)h(xj + u), (8.21) 

i 

= [wo(*)^o(®)]*[wi(®)^i(®)] (8.22) 

where the weighting functions wq and w\ are zero outside the valid ranges of the images 
and both images are padded so that circular shifts return 0 values outside the original image 
boundaries. 

An even more interesting case is the computation of the weighted SSD function intro- 
duced in Equation (8.5), 

-Ewssd(m) = ^2wo(xi)wi(xi + u)[h(xi + u) - I 0 (xi)] 2 . (8.23) 

Expanding this as a sum of correlations and deriving the appropriate set of Fourier transforms 
is left for Exercise 8.1. 

The same kind of derivation can also be applied to the bias-gain corrected sum of squared 
difference function E’bg (8.9). Again, Fourier transforms can be used to efficiently compute 
all the correlations needed to perform the linear regression in the bias and gain parameters in 
order to estimate the exposure-compensated difference for each potential shift (Exercise 8.1). 


Phase correlation. A variant of regular correlation (8.18) that is sometimes used for motion 
estimation is phase correlation (Kuglin and Hines 1975; Brown 1992). Here, the spectrum of 
the two signals being matched is whitened by dividing each per-frequency product in (8.18) 
by the magnitudes of the Fourier transforms. 


F{E PC (u)} 


II^MIIIIXiMII 


(8.24) 


before taking the final inverse Fourier transform. In the case of noiseless signals with perfect 
(cyclic) shift, we have Ii(x + u) = Iq(x) and hence, from Equation (8.17), we obtain 


= X 1 (v)e~ 2 * i ' l - U 


e -2nju-Ln 


Xo(lo) and 


T {h(x + u)} 
F {-Epc(w)} 


(8.25) 
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The output of phase correlation (under ideal conditions) is therefore a single spike (impulse) 
located at the correct value of u, which (in principle) makes it easier to find the correct 
estimate. 

Phase correlation has a reputation in some quarters of outperforming regular correlation, 
but this behavior depends on the characteristics of the signals and noise. If the original images 
are contaminated by noise in a narrow frequency band (e.g., low-frequency noise or peaked 
frequency “hum”), the whitening process effectively de-emphasizes the noise in these regions. 
However, if the original signals have very low signal-to-noise ratio at some frequencies (say, 
two blurry or low-textured images with lots of high-frequency noise), the whitening process 
can actually decrease performance (see Exercise 8.1). 

Recently, gradient cross-correlation has emerged as a promising alternative to phase cor- 
relation (Argyriou and Vlachos 2003), although further systematic studies are probably war- 
ranted. Phase correlation has also been studied by Fleet and Jepson (1990) as a method for 
estimating general optical flow and stereo disparity. 

Rotations and scale. While Fourier-based alignment is mostly used to estimate transla- 
tional shifts between images, it can, under certain limited conditions, also be used to estimate 
in-plane rotations and scales. Consider two images that are related purely by rotation, i.e., 

h(Rx)=I 0 (x). (8.26) 


If we re-sample the images into polar coordinates, 

Io(r, 0) = I 0 (r cos 9,r sin 9) and Ji(r, 9) = I\{r cos 9, r sin 9), (8.27) 

we obtain 

/iM + 0) = /oM). (8.28) 

The desired rotation can then be estimated using a Fast Fourier Transform (FFT) shift-based 
technique. 

If the two images are also related by a scale, 

h(e s Rx)=I 0 {x), (8.29) 

we can re-sample into log-polar coordinates, 

I 0 (s, 9) = I 0 (e s cos 9, e s sin 9) and Ii(s, 9) = I\{e s cos 9, e s sin 9), (8.30) 


to obtain 


I \ (s + s, 9 + 9) — /o(s, $)■ 


(8.31) 
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Figure 8.2 Taylor series approximation of a function and the incremental computation of 
the optical flow correction amount. J\{xi + u) is the image gradient at (xi + u) and ei is 
the current intensity difference. 


In this case, care must be taken to choose a suitable range of s values that reasonably samples 
the original image. 

For images that are also translated by a small amount, 

h(e s Rx + t) =I 0 (x), (8.32) 

De Castro and Morandi (1987) propose an ingenious solution that uses several steps to esti- 
mate the unknown parameters. First, both images are converted to the Fourier domain and 
only the magnitudes of the transformed images are retained. In principle, the Fourier mag- 
nitude images are insensitive to translations in the image plane (although the usual caveats 
about border effects apply). Next, the two magnitude images are aligned in rotation and scale 
using the polar or log-polar representations. Once rotation and scale are estimated, one of the 
images can be de-rotated and scaled and a regular translational algorithm can be applied to 
estimate the translational shift. 

Unfortunately, this trick only applies when the images have large overlap (small transla- 
tional motion). For more general motion of patches or images, the parametric motion estima- 
tor described in Section 8.2 or the feature-based approaches described in Section 6.1 need to 
be used. 


8.1.3 Incremental refinement 

The techniques described up till now can estimate alignment to the nearest pixel (or poten- 
tially fractional pixel if smaller search steps are used). In general, image stabilization and 
stitching applications require much higher accuracies to obtain acceptable results. 

To obtain better sub-pixel estimates, we can use one of several techniques described by 
Tian and Huhns (1986). One possibility is to evaluate several discrete (integer or fractional) 
values of (u, v) around the best value found so far and to interpolate the matching score to 
find an analytic minimum. 
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A more commonly used approach, first proposed by Lucas and Kanade (1981), is to 
perform gradient descent on the SSD energy function (8.1), using a Taylor series expansion 


of the image function (Figure 8.2), 

£lk-ssd(m + Am) = + u + A u) - / 0 («;»)] 2 (8.33) 

~ '^2[h(x i + u) + J 1 (x l + u)Au - I 0 (x.i)} 2 (8.34) 

i 

= + u)Au + e^] 2 , (8.35) 

where 

dl dl 

J 1 (x l + u) = V/i(a; i + u) = (— , -^){xi + u) (8.36) 

is the image gradient or Jacobian at (a:,; + u) and 

a = h(xi + u) - I 0 (xi), (8.37) 


first introduced in (8.1), is the current intensity error. 7 The gradient at a particular sub-pixel 
location ( Xi + u) can be computed using a variety of techniques, the simplest of which is 
to simply take the horizontal and vertical differences between pixels x and x + (1,0) or 
x + (0,1). More sophisticated derivatives can sometimes lead to noticeable performance 
improvements. 

The linearized form of the incremental update to the SSD error (8.35) is often called the 
optical flow constraint or brightness constancy constraint equation 


I x u + I y v + It — 0, (8.38) 

where the subscripts in I x and I y denote spatial derivatives, and I t is called the temporal 
derivative, which makes sense if we are computing instantaneous velocity in a video se- 
quence. When squared and summed or integrated over a region, it can be used to compute 
optic flow (Horn and Schunck 1981). 

The above least squares problem (8.35) can be minimized by solving the associated nor- 
mal equations (Appendix A. 2), 

AAu = b (8.39) 

where 

A = ^2 Ji( x i + u)J±(xi + u) (8.40) 


7 We follow the convention, commonly used in robotics and by Baker and Matthews (2004), that derivatives with 
respect to (column) vectors result in row vectors, so that fewer transposes are needed in the formulas. 
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and 

b = -^dJiixi + u) (8.41) 

i 

are called the (Gauss-Newton approximation of the) Hessian and gradient-weighted residual 
vector, respectively. x These matrices are also often written as 


A = 


E E Ixly 
E Ixly E ly 


and b = — 


E Ut 
E lylt 


(8.42) 


The gradients required for J\{xi + u) can be evaluated at the same time as the image 
warps required to estimate 7] (x, + u) (Section 3.6.1 (3.89)) and, in fact, are often computed 
as a side-product of image interpolation. If efficiency is a concern, these gradients can be 
replaced by the gradients in the template image. 


Ji(xi + u) « Jo(xi), (8.43) 

since near the correct alignment, the template and displaced target images should look sim- 
ilar. This has the advantage of allowing the pre -computation of the Hessian and Jacobian 
images, which can result in significant computational savings (Hager and Belhumeur 1998; 
Baker and Matthews 2004). A further reduction in computation can be obtained by writing 
the warped image 7 1 (a; i + u) used to compute e, in (8.37) as a convolution of a sub-pixel 
interpolation filter with the discrete samples in l\ (Peleg and Rav-Acha 2006). Precomput- 
ing the inner product between the gradient field and shifted version of 7i allows the iterative 
re-computation of e, to be performed in constant time (independent of the number of pixels). 

The effectiveness of the above incremental update rule relies on the quality of the Taylor 
series approximation. When far away from the true displacement (say, 1-2 pixels), several 
iterations may be needed. It is possible, however, to estimate a value for J -\ using a least 
squares fit to a series of larger displacements in order to increase the range of convergence 
(Jurie and Dhome 2002) or to “learn” a special-purpose recognizer for a given patch (Avi- 
dan 2001; Williams, Blake, and Cipolla 2003; Lepetit, Pilet, and Fua 2006; Hinterstoisser, 
Benhimane, Navab et al. 2008; Ozuysal, Calonder, Lepetit et al. 2010) as discussed in Sec- 
tion 4.1.4. 

A commonly used stopping criterion for incremental updating is to monitor the magnitude 
of the displacement correction ||w|| and to stop when it drops below a certain threshold (say, 
Vio of a pixel). For larger motions, it is usual to combine the incremental update rule with a 
hierarchical coarse-to-fine search strategy, as described in Section 8.1.1. 

8 The true Hessian is the full second derivative of the error function E, which may not be positive definite — see 
Section 6.1.3 and Appendix A. 3. 
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Figure 8.3 Aperture problems for different image regions, denoted by the orange and red 
L-shaped structures, overlaid in the same image to make it easier to diagram the flow, (a) A 
window w{xi ) centered at x, (black circle) can uniquely be matched to its corresponding 
structure at x, + u in the second (red) image, (b) A window centered on the edge exhibits the 
classic aperture problem, since it can be matched to a ID family of possible locations, (c) In 
a completely textureless region, the matches become totally unconstrained. 


Conditioning and aperture problems. Sometimes, the inversion of the linear system (8.39) 
can be poorly conditioned because of lack of two-dimensional texture in the patch being 
aligned. A commonly occurring example of this is the aperture problem , first identified in 
some of the early papers on optical flow (Horn and Schunck 1981) and then studied more ex- 
tensively by Anandan (1989). Consider an image patch that consists of a slanted edge moving 
to the right (Figure 8.3). Only the normal component of the velocity (displacement) can be 
reliably recovered in this case. This manifests itself in (8.39) as a rank-deficient matrix A, 
i.e., one whose smaller eigenvalue is very close to zero. 9 

When Equation (8.39) is solved, the component of the displacement along the edge is very 
poorly conditioned and can result in wild guesses under small noise perturbations. One way 
to mitigate this problem is to add a prior (soft constraint) on the expected range of motions 
(Simoncelli, Adelson, and Heeger 1991; Baker, Gross, and Matthews 2004; Govindu 2006). 
This can be accomplished by adding a small value to the diagonal of A, which essentially 
biases the solution towards smaller Au values that still (mostly) minimize the squared error. 

However, the pure Gaussian model assumed when using a simple (fixed) quadratic prior, 
as in (Simoncelli, Adelson, and Heeger 1991), does not always hold in practice, e.g., because 
of aliasing along strong edges (Triggs 2004). For this reason, it may be prudent to add some 
small fraction (say, 5%) of the larger eigenvalue to the smaller one before doing the matrix 
inversion. 


‘"the matrix A is by construction always guaranteed to be symmetric positive semi-definite, i.e., it has real 
non-negative eigenvalues. 
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Figure 8.4 SSD surfaces corresponding to three locations (red crosses) in an image: 
(a) highly textured area, strong minimum, low uncertainty; (b) strong edge, aperture prob- 
lem, high uncertainty in one direction; (c) weak texture, no clear minimum, large uncertainty. 
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Uncertainty modeling. The reliability of a particular patch-based motion estimate can be 
captured more formally with an uncertainty model. The simplest such model is a covariance 
matrix , which captures the expected variance in the motion estimate in all possible directions. 
As discussed in Section 6.1.4 and Appendix B.6, under small amounts of additive Gaussian 
noise, it can be shown that the covariance matrix is proportional to the inverse of the 
Hessian A, 

£• u = v 2 n A (8.44) 

where <r 2 is the variance of the additive Gaussian noise (Anandan 1989; Matthies, Kanade, 
and Szeliski 1989; Szeliski 1989). 

For larger amounts of noise, the linearization performed by the Lucas-Kanade algorithm 
in (8.35) is only approximate, so the above quantity becomes a Cramer-Rao lower bound on 
the true covariance. Thus, the minimum and maximum eigenvalues of the Hessian A can now 
be interpreted as the (scaled) inverse variances in the least-certain and most-certain directions 
of motion. (A more detailed analysis using a more realistic model of image noise is given by 
Steele and Jaynes (2005).) Figure 8.4 shows the local SSD surfaces for three different pixel 
locations in an image. As you can see, the surface has a clear minimum in the highly textured 
region and suffers from the aperture problem near the strong edge. 

Bias and gain, weighting, and robust error metrics. The Lucas-Kanade update rule can 
also be applied to the bias-gain equation (8.9) to obtain 

£ ; lk-bg(w + Am) = + m)Am + e t - aI Q (xj) - (3 ] 2 (8.45) 

i 

(Lucas and Kanade 1981; Gennert 1988; Fuh and Maragos 1991; Baker, Gross, and Matthews 
2003). The resulting 4x4 system of equations can be solved to simultaneously estimate the 
translational displacement update Am and the bias and gain parameters (3 and a. 

A similar formulation can be derived for images (templates) that have a linear appearance 
variation, 

h{x + u) ss Ip(x) + ^ A jBj(x), (8.46) 

3 

where the Bj(x) are the basis images and the A j are the unknown coefficients (Hager and 
Belhumeur 1998; Baker, Gross, Ishikawa et al. 2003; Baker, Gross, and Matthews 2003). 
Potential linear appearance variations include illumination changes (Hager and Belhumeur 
1998) and small non-rigid deformations (Black and Jepson 1998). 

A weighted (windowed) version of the Lucas-Kanade algorithm is also possible: 


-Elk-wssd(m + Am) = y~] wo(x i )w 1 (x i + u)[Ji{xi + m)Am + e,] 2 . 


(8.47) 
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Note that here, in deriving the Lucas-Kanade update from the original weighted SSD function 
(8.5), we have neglected taking the derivative of the w±(xi + u) weighting function with 
respect to it, which is usually acceptable in practice, especially if the weighting function is a 
binary mask with relatively few transitions. 

Baker, Gross, Ishikawa et al. (2003) only use the wq(x) term, which is reasonable if the 
two images have the same extent and no (independent) cutouts in the overlap region. They 
also discuss the idea of making the weighting proportional to V/(at), which helps for very 
noisy images, where the gradient itself is noisy. Similar observations, formulated in terms 
of total least squares (Van Huffel and Vandewalle 1991; Van Huffel and Lemmerling 2002), 
have been made by other researchers studying optical flow (Weber and Malik 1995; Bab- 
Hadiashar and Suter 1998b; Miihlich and Mester 1998). Lastly, Baker, Gross, Ishikawa et al. 
(2003) show how evaluating Equation (8.47) at just the most reliable (highest gradient) pixels 
does not significantly reduce performance for large enough images, even if only 5-10% of 
the pixels are used. (This idea was originally proposed by Dellaert and Collins (1999), who 
used a more sophisticated selection criterion.) 

The Lucas-Kanade incremental refinement step can also be applied to the robust error 
metric introduced in Section 8.1, 


which can be solved using the iteratively reweighted least squares technique described in 
Section 6.1.4. 


8.2 Parametric motion 

Many image alignment tasks, for example image stitching with handheld cameras, require 
the use of more sophisticated motion models, as described in Section 2.1.2. Since these 
models, e.g., affine deformations, typically have more parameters than pure translation, a 
full search over the possible range of values is impractical. Instead, the incremental Lucas- 
Kanade algorithm can be generalized to parametric motion models and used in conjunction 
with a hierarchical search algorithm (Lucas and Kanade 1981; Rehg and Witkin 1991; Fuh 
and Maragos 1991; Bergen, Anandan, Hanna et al. 1992; Shashua and Toelg 1997; Shashua 
and Wexler 2001; Baker and Matthews 2004). 

For parametric motion, instead of using a single constant translation vector u , we use 
a spatially varying motion field or correspondence map, x'(x;p), parameterized by a low- 
dimensional vector p, where x' can be any of the motion models presented in Section 2.1.2. 



(8.48) 
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The parametric incremental motion update rule now becomes 

#lk-pm(p + Ap) = 'Y'jl 1 (x'(x i ;p + Ap)) -/ 0 ( Xj )} 2 

i 

~ ^[h{ x \) + J i{ x 'i)Ap- I 0 (xi)f 

i 


where the Jacobian is now 


, , dh dx’ , . 


(8.49) 

(8.50) 

(8.51) 

(8.52) 


i.e., the product of the image gradient V/i with the Jacobian of the correspondence field, 
,J x i = dx' /dp. 

The motion Jacobians J x : for the 2D planar transformations introduced in Section 2.1.2 
and Table 2. 1 are given in Table 6.1. Note how we have re -parameterized the motion matrices 
so that they are always the identity at the origin p = 0. This becomes useful later, when we 
talk about the compositional and inverse compositional algorithms. (It also makes it easier to 
impose priors on the motions.) 

For parametric motion, the (Gauss-Newton) Hessian and gradient-weighted residual vec- 
tor become 

A = Y,Jl'(*i)[V%{xWh(x$]J x \x i ) (8.53) 


and 


b = -Y J J T x'^)\^iI«)\- 


(8.54) 


Note how the expressions inside the square brackets are the same ones evaluated for the 
simpler translational motion case (8.40-8.41). 


Patch-based approximation. The computation of the Hessian and residual vectors for 
parametric motion can be significantly more expensive than for the translational case. For 
parametric motion with n parameters and N pixels, the accumulation of A and b takes 
0(n 2 N) operations (Baker and Matthews 2004). One way to reduce this by a significant 
amount is to divide the image up into smaller sub-blocks (patches) Pj and to only accumulate 
the simpler 2x2 quantities inside the square brackets at the pixel level (Shum and Szeliski 
2000 ), 

Aj = Y. V/f(aJ)V/ 1 (®J) (8.55) 

tePj 

b i = Y e ^ I T( x 'i)- 

tePj 


(8.56) 
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The full Hessian and residual can then be approximated as 

A « X;j£'(*i)[E VI i( x i) VI ^ a = ^Z J x'{xj)Aj J X '{x j ) (8.57) 

i tePj j 

and 

-E e * V/ i T (^)] = -'E J Z S '@i) h 3, (8-58) 

3 i&Pj 3 

where is the center of each patch Pj (Shum and Szeliski 2000). This is equivalent to 
replacing the true motion Jacobian with a piecewise-constant approximation. In practice, 
this works quite well. The relationship of this approximation to feature-based registration is 
discussed in Section 9.2.4. 


Compositional approach. For a complex parametric motion such as a homography, the 
computation of the motion Jacobian becomes complicated and may involve a per-pixel divi- 
sion. Szeliski and Shum (1997) observed that this can be simplified by first warping the target 
image Ii according to the current motion estimate x'(x] p). 


h(x) = h{x'(x-,p)), (8.59) 

and then comparing this warped image against the template Iq{x), 

-Elk-ss(Ap) = ^[/i(s(s I ;Ap))-7o(a: ! )] 2 (8.60) 

i 

~ E^(*.)Ap+e,] 2 (8.61) 

i 

= ^2[Vii{xi)Jx(xi)Ap+ a] 2 . (8.62) 

i 


Note that since the two images are assumed to be fairly similar, only an incremental para- 
metric motion is required, i.e., the incremental motion can be evaluated around p = 0, which 
can lead to considerable simplifications. For example, the Jacobian of the planar projective 
transform (6.19) now becomes 


dx 

dp 



X 

y 

l 

0 

0 

0 

—x 2 

- xy 

p= 0 

0 

0 

0 

X 

y 

l 

-xy 

-y 2 _ 


(8.63) 


Once the incremental motion x has been computed, it can be prepended to the previously 
estimated motion, which is easy to do for motions represented with transformation matrices, 
such as those given in Tables 2.1 and 6.1. Baker and Matthews (2004) call this the, forward 
compositional algorithm, since the target image is being re-warped and the final motion esti- 
mates are being composed. 
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If the appearance of the warped and template images is similar enough, we can replace 
the gradient of I\(x) with the gradient of Io(x), as suggested previously (8.43). This has po- 
tentially a big advantage in that it allows the pre-computation (and inversion) of the Hessian 
matrix A given in Equation (8.53). The residual vector b (8.54) can also be partially precom- 
puted, i.e., the steepest descent images V Iq(x)J j.(x) can precomputed and stored for later 
multiplication with the e(x) = I\(x) — I$(x) error images (Baker and Matthews 2004). This 
idea was first suggested by Hager and Belhumeur (1998) in what Baker and Matthews (2004) 
call a inverse additive scheme. 

Baker and Matthews (2004) introduce one more variant they call the inverse composi- 
tional algorithm. Rather than (conceptually) re-warping the warped target image 1\ (x), they 
instead warp the template image Iq(x) and minimize 

-Elk-bm(A p ) = 5^1 ( x i) ~ J 0 (x(xi; A p ))) 2 

i 

~ ^ IpjXj )J Ap e^] 

i 

This is identical to the forward warped algorithm (8.62) with the gradients V/i(a:) replaced 
by the gradients V Iq(x), except for the sign of e,. The resulting update A p is the negative of 
the one computed by the modified Equation (8.62) and hence the inverse of the incremental 
transformation must be prepended to the current transform. Because the inverse composi- 
tional algorithm has the potential of pre -computing the inverse Hessian and the steepest de- 
scent images, this makes it the preferred approach of those surveyed by Baker and Matthews 
(2004). Figure 8.5 (Baker, Gross, Ishikawa et al. 2003) beautifully shows all of the steps 
required to implement the inverse compositional algorithm. 

Baker and Matthews (2004) also discuss the advantage of using Gauss-Newton iteration 
(i.e., the first-order expansion of the least squares, as above) compared to other approaches 
such as steepest descent and Levenberg-Marquardt. Subsequent parts of the series (Baker, 
Gross, Ishikawa et al. 2003; Baker, Gross, and Matthews 2003, 2004) discuss more advanced 
topics such as per-pixel weighting, pixel selection for efficiency, a more in-depth discussion of 
robust metrics and algorithms, linear appearance variations, and priors on parameters. They 
make for invaluable reading for anyone interested in implementing a highly tuned imple- 
mentation of incremental image registration. Evangelidis and Psarakis (2008) provide some 
detailed experimental evaluations of these and other related approaches. 

8.2.1 Application : Video stabilization 

Video stabilization is one of the most widely used applications of parametric motion esti- 
mation (Hansen, Anandan, Dana et al. 1994; Irani, Rousso, and Peleg 1997; Morimoto and 


(8.64) 

(8.65) 
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/<W(x;p))-r(x) 


Step 2 







£,iVT^n/(W(x: p)) - T(x)] 


Figure 8.5 A schematic overview of the inverse compositional algorithm (copied, with 
permission, from (Baker, Gross, Ishikawa el al. 2003)). Steps 3-6 (light-colored arrows) are 
performed once as a pre-computation. The main algorithm simply consists of iterating: image 
warping (Step 1), image differencing (Step 2), image dot products (Step 7), multiplication 
with the inverse of the Hessian (Step 8), and the update to the warp (Step 9). All of these 
steps can be performed efficiently. 
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Chellappa 1997; Srinivasan, Chellappa, Veeraraghavan et al. 2005). Algorithms for stabiliza- 
tion run inside both hardware devices, such as camcorders and still cameras, and software 
packages for improving the visual quality of shaky videos. 

In their paper on full-frame video stabilization, Matsushita, Ofek, Ge et al. (2006) give 
a nice overview of the three major stages of stabilization, namely motion estimation, motion 
smoothing, and image warping. Motion estimation algorithms often use a similarity trans- 
form to handle camera translations, rotations, and zooming. The tricky part is getting these 
algorithms to lock onto the background motion, which is a result of the camera movement, 
without getting distracted by independent moving foreground objects. Motion smoothing al- 
gorithms recover the low-frequency (slowly varying) part of the motion and then estimate 
the high-frequency shake component that needs to be removed. Finally, image warping algo- 
rithms apply the high-frequency correction to render the original frames as if the camera had 
undergone only the smooth motion. 

The resulting stabilization algorithms can greatly improve the appearance of shaky videos 
but they often still contain visual artifacts. For example, image warping can result in missing 
borders around the image, which must be cropped, filled using information from other frames, 
or hallucinated using inpainting techniques (Section 10.5.1). Furthermore, video frames cap- 
tured during fast motion are often blurry. Their appearance can be improved either using 
deblurring techniques (Section 10.3) or stealing sharper pixels from other frames with less 
motion or better focus (Matsushita, Ofek, Ge et al. 2006). Exercise 8.3 has you implement 
and test some of these ideas. 

In situations where the camera is translating a lot in 3D, e.g., when the videographer is 
walking, an even better approach is to compute a full structure from motion reconstruction 
of the camera motion and 3D scene. A smooth 3D camera path can then be computed and 
the original video re-rendered using view interpolation with the interpolated 3D point cloud 
serving as the proxy geometry while preserving salient features (Liu, Gleicher, Jin et al. 
2009). If you have access to a camera array instead of a single video camera, you can do even 
better using a light field rendering approach (Section 13.3) (Smith, Zhang, Jin et al. 2009). 


8.2.2 Learned motion models 

An alternative to parameterizing the motion field with a geometric deformation such as an 
affine transform is to learn a set of basis functions tailored to a particular application (Black, 
Yacoob, Jepson et al. 1997). First, a set of dense motion fields (Section 8.4) is computed from 
a set of training videos. Next, singular value decomposition (SVD) is applied to the stack of 
motion fields u t (x) to compute the first few singular vectors Vfjx). Finally, for a new test 
sequence, a novel flow field is computed using a coarse-to-fine algorithm that estimates the 
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Figure 8.6 Learned parameterized motion fields for a walking sequence (Black, Yacoob, 
Jepson el al. 1997) © 1997 IEEE: (a) learned basis flow fields; (b) plots of motion coefficients 
over time and corresponding estimated motion fields. 


unknown coefficient at in the parameterized flow field 

u(x) = ^2a k v k (x). (8.66) 

k 

Figure 8.6a shows a set of basis fields learned by observing videos of walking motions. 
Figure 8.6b shows the temporal evolution of the basis coefficients as well as a few of the 
recovered parametric motion fields. Note that similar ideas can also be applied to feature 
tracks (Torresani, Hertzmann, and Bregler 2008), which is a topic we discuss in more detail 
in Sections 4.1.4 and 12.6.4. 


8.3 Spline-based motion 

While parametric motion models are useful in a wide variety of applications (such as video 
stabilization and mapping onto planar surfaces), most image motion is too complicated to be 
captured by such low-dimensional models. 

Traditionally, optical flow algorithms (Section 8.4) compute an independent motion esti- 
mate for each pixel, i.e., the number of flow vectors computed is equal to the number of input 
pixels. The general optical flow analog to Equation (8.1) can thus be written as 

EsSD-OF ({«*}) = ^[Fl(*i + «i) - Io{x z )] 2 . 


(8.67) 
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Figure 8.7 Spline motion field: the displacement vectors U{ = ( m , ) are shown as pluses 

(+) and are controlled by the smaller number of control vertices Uj = [u^vf), which are 
shown as circles (o). 

Notice how in the above equation, the number of variables {?/, } is twice the number of 
measurements, so the problem is underconstrained. 

The two classic approaches to this problem, which we study in Section 8.4, are to perform 
the summation over overlapping regions (the patch-based or window-based approach) or to 
add smoothness terms on the {it,:} field using regularization or Markov random fields (Sec- 
tion 3.7). In this section, we describe an alternative approach that lies somewhere between 
general optical flow (independent flow at each pixel) and parametric flow (a small number of 
global parameters). The approach is to represent the motion field as a two-dimensional spline 
controlled by a smaller number of control vertices {«_,} (Figure 8.7), 

Ui = ^ UjBfixi) = ^ UjWij, ( 8 . 68 ) 

i j 

where the Bj(xi) are called the basis functions and are only non-zero over a small finite sup- 
port interval (Szeliski and Coughlan 1997). We call the w,j = Bfixf) weights to emphasize 
that the {u^ are known linear combinations of the { u :l } . Some commonly used spline basis 
functions are shown in Figure 8.8. 

Substituting the formula for the individual per-pixel flow vectors Ui (8.68) into the SSD 
error metric (8.67) yields aparametric motion formula similar to Equation (8.50). The biggest 
difference is that the Jacobian J \{x'f) (8.52) now consists of the sparse entries in the weight 
matrix W = [try]. 

In situations where we know something more about the motion field, e.g., when the mo- 
tion is due to a camera moving in a static scene, we can use more specialized motion models. 
For example, the plane plus parallax model (Section 2.1.5) can be naturally combined with 
a spline-based motion representation, where the in-plane motion is represented by a homog- 
raphy (6.19) and the out-of-plane parallax d is represented by a scalar variable at each spline 


406 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 




Figure 8.8 Sample spline basis functions (Szeliski and Coughlan 1997) © 1997 Springer. 
The block (constant) interpolator/basis corresponds to block-based motion estimation 
(Le Gall 1991). See Section 3.5.1 for more details on spline functions. 
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Figure 8.9 Quadtree spline-based motion estimation (Szeliski and Shum 1996) © 1996 
IEEE: (a) quadtree spline representation, (b) which can lead to cracks, unless the white nodes 
are constrained to depend on their parents; (c) deformed quadtree spline mesh overlaid on 
grayscale image; (d) flow field visualized as a needle diagram. 


control point (Szeliski and Kang 1995; Szeliski and Coughlan 1997). 

In many cases, the small number of spline vertices results in a motion estimation problem 
that is well conditioned. However, if large textureless regions (or elongated edges subject 
to the aperture problem) persist across several spline patches, it may be necessary to add a 
regularization term to make the problem well posed (Section 3.7.1). The simplest way to 
do this is to directly add squared difference penalties between adjacent vertices in the spline 
control mesh {uj}, as in (3.100). If a multi-resolution (coarse-to-fine) strategy is being used, 
it is important to re-scale these smoothness terms while going from level to level. 

The linear system corresponding to the spline-based motion estimator is sparse and regu- 
lar. Because it is usually of moderate size, it can often be solved using direct techniques such 
as Cholesky decomposition (Appendix A.4). Alternatively, if the problem becomes too large 
and subject to excessive fill-in, iterative techniques such as hierarchically preconditioned con- 
jugate gradient (Szeliski 1990b, 2006b) can be used instead (Appendix A. 5). 

Because of its robustness, spline-based motion estimation has been used for a number 
of applications, including visual effects (Roble 1999) and medical image registration (Sec- 
tion 8.3.1) (Szeliski and Lavallee 1996; Kybic and Unser 2003). 

One disadvantage of the basic technique, however, is that the model does a poor job 
near motion discontinuities, unless an excessive number of nodes is used. To remedy this 
situation, Szeliski and Shum (1996) propose using a quadtree representation embedded in the 
spline control grid (Figure 8.9a). Large cells are used to present regions of smooth motion, 
while smaller cells are added in regions of motion discontinuities (Figure 8.9c). 

To estimate the motion, a coarse-to-fine strategy is used. Starting with a regular spline 
imposed over a lower-resolution image, an initial motion estimate is obtained. Spline patches 
where the motion is inconsistent, i.e., the squared residual (8.67) is above a threshold, are 
subdivided into smaller patches. In order to avoid cracks in the resulting motion field (Fig- 
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Figure 8.10 Elastic brain registration (Kybic and Unser 2003) © 2003 IEEE: (a) original 
brain atlas and patient MRI images overlaid in red-green; (b) after elastic registration with 
eight user-specified landmarks (not shown); (c) a cubic B-spline deformation field, shown as 
a deformed grid. 


ure 8.9b), the values of certain nodes in the refined mesh, i.e., those adjacent to larger cells, 
need to be restricted so that they depend on their parent values. This is most easily accom- 
plished using a hierarchical basis representation for the quadtree spline (Szeliski 1990b) and 
selectively setting some of the hierarchical basis functions to 0, as described in (Szeliski and 
Shum 1996). 

8.3.1 Application : Medical image registration 

Because they excel at representing smooth elastic deformation fields, spline-based motion 
models have found widespread use in medical image registration (Bajcsy and Kovacic 1989; 
Szeliski and Lavallee 1996; Christensen, Joshi, and Miller 1997). 1,1 Registration techniques 
can be used both to track an individual patient’s development or progress over time (a lon- 
gitudinal study) or to match different patient images together to find commonalities and de- 
tect variations or pathologies ( cross-sectional studies). When different imaging modalities 
are being registered, e.g., computed tomography (CT) scans and magnetic resonance images 
(MRI), mutual information measures of similarity are often necessary (Viola and Wells III 
1997; Maes, Collignon, Vandermeulen et al. 1997). 

Kybic and Unser (2003) provide a nice literature review and describe a complete working 
system based on representing both the images and the deformation fields as multi-resolution 
splines. Figure 8.10 shows an example of the Kybic and Unser system being used to register 
a patient’s brain MRI with a labeled brain atlas image. The system can be run in a fully auto- 

10 In computer graphics, such elastic volumetric deformation are known as free-form deformations (Sederberg and 
Parry 1986; Coquillart 1990; Celniker and Gossard 1991). 
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(a) (b) (c) 


Figure 8.11 Octree spline-based image registration of two vertebral surface models (Szeliski 
and Lavallee 1996) © 1996 Springer: (a) after initial rigid alignment; (b) after elastic align- 
ment; (c) a cross-section through the adapted octree spline deformation field. 

matic mode but more accurate results can be obtained by locating a few key landmarks. More 
recent papers on deformable medical image registration, including performance evaluations, 
include (Klein, Staring, and Pluim 2007; Glocker, Komodakis, Tziritas et al. 2008). 

As with other applications, regular volumetric splines can be enhanced using selective 
refinement. In the case of 3D volumetric image or surface registration, these are known as 
octree splines (Szeliski and Lavallee 1996) and have been used to register medical surface 
models such as vertebrae and faces from different patients (Figure 8.11). 

8.4 Optical flow 

The most general (and challenging) version of motion estimation is to compute an indepen- 
dent estimate of motion at each pixel, which is generally known as optical (or optic) flow. As 
we mentioned in the previous section, this generally involves minimizing the brightness or 
color difference between corresponding pixels summed over the image, 

-^ssd— of ({«*}) = y + tii) — I 0 (xi)] 2 . (8.69) 

i 

Since the number of variables {u, } is twice the number of measurements, the problem is 
underconstrained. The two classic approaches to this problem are to perform the summa- 
tion locally over overlapping regions (the patch-based or window-based approach) or to 
add smoothness terms on the {«.;} field using regularization or Markov random fields (Sec- 
tion 3.7) and to search for a global minimum. 

The patch-based approach usually involves using a Taylor series expansion of the dis- 
placed image function (8.35) in order to obtain sub-pixel estimates (Lucas and Kanade 1981). 
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Anandan (1989) shows how a series of local discrete search steps can be interleaved with 
Lucas-Kanade incremental refinement steps in a coarse-to-fine pyramid scheme, which al- 
lows the estimation of large motions, as described in Section 8.1.1. He also analyzes how the 
uncertainty in local motion estimates is related to the eigenvalues of the local Hessian matrix 
A, (8.44), as shown in Figures 8. 3-8. 4. 

Bergen, Anandan, Hanna el al. (1992) develop a unified framework for describing both 
parametric (Section 8.2) and patch-based optic flow algorithms and provide a nice introduc- 
tion to this topic. After each iteration of optic flow estimation in a coarse-to-fine pyramid, 
they re-warp one of the images so that only incremental flow estimates are computed (Sec- 
tion 8.1.1). When overlapping patches are used, an efficient implementation is to first com- 
pute the outer products of the gradients and intensity errors (8.40-8.41) at every pixel and 
then perform the overlapping window sums using a moving average filter. 1 1 

Instead of solving for each motion (or motion update) independently, Horn and Schunck 
(1981) develop a regularization-based framework where (8.69) is simultaneously minimized 
over all flow vectors {it;}. In order to constrain the problem, smoothness constraints, i.e., 
squared penalties on flow derivatives, are added to the basic per-pixel error metric. Because 
the technique was originally developed for small motions in a variational (continuous func- 
tion) framework, the linearized brightness constancy constraint corresponding to (8.35), i.e., 
(8.38), is more commonly written as an analytic integral 


where ( I x ,I y ) = V / 1 = J\ and I t = is the temporal derivative , i.e., the brightness 
change between images. The Horn and Schunck model can also be viewed as the limiting 
case of spline-based motion estimation as the splines become lxl pixel patches. 

It is also possible to combine ideas from local and global flow estimation into a single 
framework by using a locally aggregated (as opposed to single-pixel) Hessian as the bright- 
ness constancy term (Bruhn, Weickert, and Schnorr 2005). Consider the discrete analog 
(8.35) to the analytic global energy (8.70), 


If we replace the per-pixel (rank 1) Hessians A, = [J , j/] and residuals b, = J,e. t with area- 
aggregated versions (8.40-8.41), we obtain a global minimization algorithm where region- 
based brightness constraints are used. 

Another extension to the basic optic flow model is to use a combination of global (para- 
metric) and local motion models. For example, if we know that the motion is due to a camera 

11 Other smoothing or aggregation filters can also be used at this stage (Bruhn. Weickert, and Schnorr 2005). 



(8.70) 



(8.71) 
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moving in a static scene (rigid motion), we can re-formulate the problem as the estimation of 
a per-pixel depth along with the parameters of the global camera motion (Adiv 1989; Hanna 
1991; Bergen, Anandan, Hanna et al. 1992; Szeliski and Coughlan 1997; Nir, Bruckstein, 
and Kimmel 2008; Wedel, Cremers, Pock el al. 2009). Such techniques are closely related to 
stereo matching (Chapter 11). Alternatively, we can estimate either per-image or per-segment 
affine motion models combined with per-pixel residual corrections (Black and Jepson 1996; 
Ju, Black, and Jepson 1996; Chang, Tekalp, and Sezan 1997; Memin and Perez 2002). We 
revisit this topic in Section 8.5. 

Of course, image brightness may not always be an appropriate metric for measuring ap- 
pearance consistency, e.g., when the lighting in an image is varying. As discussed in Sec- 
tion 8.1, matching gradients, filtered images, or other metrics such as image Hessians (sec- 
ond derivative measures) may be more appropriate. It is also possible to locally compute the 
phase of steerable filters in the image, which is insensitive to both bias and gain transforma- 
tions (Fleet and Jepson 1990). Papenberg, Bruhn, Brox et al. (2006) review and explore such 
constraints and also provide a detailed analysis and justification for iteratively re -warping 
images during incremental flow computation. 

Because the brightness constancy constraint is evaluated at each pixel independently, 
rather than being summed over patches where the constant flow assumption may be violated, 
global optimization approaches tend to perform better near motion discontinuities. This is 
especially true if robust metrics are used in the smoothness constraint (Black and Anandan 
1996; Bab-Hadiashar and Suter 1998a). 12 One popular choice for robust metrics in the L \ 
norm, also known as total variation (TV), which results in a convex energy whose global 
minimum can be found (Bruhn, Weickert, and Schnorr 2005; Papenberg, Bruhn, Brox et 
al. 2006). Anisotropic smoothness priors, which apply a different smoothness in the direc- 
tions parallel and perpendicular to the image gradient, are another popular choice (Nagel and 
Enkelmann 1986; Sun, Roth, Lewis et al. 2008; Werlberger, Trobin, Pock et al. 2009). It 
is also possible to learn a set of better smoothness constraints (derivative filters and robust 
functions) from a set of paired flow and intensity images (Sun, Roth, Lewis et al. 2008). Ad- 
ditional details on some of these techniques are given by Baker, Black, Lewis et al. (2007) 
and Baker, Scharstein, Lewis et al. (2009). 

Because of the large, two-dimensional search space in estimating flow, most algorithms 
use variations of gradient descent and coarse-to-fine continuation methods to minimize the 
global energy function. This contrasts starkly with stereo matching (which is an “easier” 
one-dimensional disparity estimation problem), where combinatorial optimization techniques 
have been the method of choice for the last decade. 

Fortunately, combinatorial optimization methods based on Markov random fields are be- 

12 Robust brightness metrics (Section 8.1, (8.2)) can also help improve the performance of window-based ap- 
proaches (Black and Anandan 1996). 
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0.61 19 1.53 20 0.52 is 

1.01 30 1.73 20 0.8020 

0.78 23 2.02 m 0.77 20 

1.26 m 1.58i9 1.55 m 

1.43 ii 2.59 m 1.00 is 

O.I611 0.18is 0.153 

1.51 21 2.50 21 1.88:1 

TLOOFE [24] 

19.6 

0.38 23 0.64 23 0.47 a 

1.1622 1.7222 12622 

1.3925 2.062*1.1723 

1.29 m 2.21 23 1.41m 

1.27:i 1.6120 1.5721 

1.289 2.5721 1.01 19 

0.136 0.159 0.16i 

1.87 m 2.71 22 2.53 m 

FOLKI[16] 

22.6 

0.292:0.7321 0.3322 

1.5223 1.96211.8023 

1.232:2.04:3 0.95 21 

0.99 :i 2.2022 1.0821 

1.53 m 1.85 M 2.07 m 

2.14 m 3.23 :i 1.60 m 

0.26 m 0.21m 0.68 m 

2.67 m 3.27 m 4.32 m 

Pyramid LK [2] 

23.7 

0.39 :i 0.61 21 0.6121 

T67:i 1.7823 2.00:i 

160 21 1.9722 1.3821 

167212.3921 1.7821 

2.9421 3.72 :i 2.98 :i 

3.33 2i 2.74 m 2.43 2i 

0.30 2i 0.24 2i 0.73 2i 

3.8021 5.0821 4.8821 


Move the mouse over the numbers in the table to see the corresponding images. Click to compare with the ground truth. 



Figure 8.12 Evaluation of the results of 24 optical flow algorithms, October 2009, http: 
//vision. middlebury.edu/flow/, (Baker, Scharstein, Lewis et al. 2009). By moving the mouse 
pointer over an underlined performance score, the user can interactively view the correspond- 
ing flow and error maps. Clicking on a score toggles between the computed and ground truth 
flows. Next to each score, the corresponding rank in the current column is indicated by a 
smaller blue number. The minimum (best) score in each column is shown in boldface. The 
table is sorted by the average rank (computed over all 24 columns, three region masks for each 
of the eight sequences). The average rank serves as an approximate measure of performance 
under the selected metric/statistic. 
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ginning to appear and tend to be among the better-performing methods on the recently re- 
leased optical flow database (Baker, Black, Lewis el al. 2007). 13 

Examples of such techniques include the one developed by Glocker, Paragios, Komodakis 
el al. (2008), who use a coarse-to-fine strategy with per-pixel 2D uncertainty estimates, which 
are then used to guide the refinement and search at the next finer level. Instead of using gra- 
dient descent to refine the flow estimates, a combinatorial search over discrete displacement 
labels (which is able to find better energy minima) is performed using their Fast-PD algorithm 
(Komodakis, Tziritas, and Paragios 2008). 

Lempitsky, Roth, and Rother. (2008) use fusion moves (Lempitsky, Rother, and Blake 
2007) over proposals generated from basic flow algorithms (Horn and Schunck 1981; Lucas 
and Kanade 1981) to find good solutions. The basic idea behind fusion moves is to replace 
portions of the current best estimate with hypotheses generated by more basic techniques 
(or their shifted versions) and to alternate them with local gradient descent for better energy 
minimization. 

The field of accurate motion estimation continues to evolve at a rapid pace, with signif- 
icant advances in performance occurring every year. The optical flow evaluation Web site 
(http://vision.middlebury.edu/flow/) is a good source of pointers to high-performing recently 
developed algorithms (Figure 8.12). 

8.4.1 Multi-frame motion estimation 

So far, we have looked at motion estimation as a two-frame problem, where the goal is to 
compute a motion field that aligns pixels from one image with those in another. In practice, 
motion estimation is usually applied to video, where a whole sequence of frames is available 
to perform this task. 

One classic approach to multi-frame motion is to filter the spatio-temporal volume using 
oriented or steerable filters (Heeger 1988), in a manner analogous to oriented edge detec- 
tion (Section 3.2.3). Figure 8.13 shows two frames from the commonly used flower garden 
sequence, as well as a horizontal slice through the spatio-temporal volume, i.e., the 3D vol- 
ume created by stacking all of the video frames together. Because the pixel motion is mostly 
horizontal, the slopes of individual (textured) pixel tracks, which correspond to their horizon- 
tal velocities, can clearly be seen. Spatio-temporal filtering uses a 3D volume around each 
pixel to determine the best orientation in space-time, which corresponds directly to a pixel’s 
velocity. 

Unfortunately, in order to obtain reasonably accurate velocity estimates everywhere in 
an image, spatio-temporal filters have moderately large extents, which severely degrades the 
quality of their estimates near motion discontinuities. (This same problem is endemic in 

13 


http://vision.middlebury.edu/flow/. 
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(a) (b) (c) 


Figure 8.13 Slice through a spatio-temporal volume (Szeliski 1999) © 1999 IEEE: (a-b) 
two frames from the flower garden sequence; (c) a horizontal slice through the complete 
spatio-temporal volume, with the arrows indicating locations of potential key frames where 
flow is estimated. Note that the colors for the flower garden sequence are incorrect; the correct 
colors (yellow flowers) are shown in Figure 8.15. 


2D window-based motion estimators.) An alternative to full spatio-temporal filtering is to 
estimate more local spatio-temporal derivatives and use them inside a global optimization 
framework to fill in textureless regions (Bruhn, Weickert, and Schnorr 2005; Govindu 2006). 

Another alternative is to simultaneously estimate multiple motion estimates, while also 
optionally reasoning about occlusion relationships (Szeliski 1999). Figure 8.13c shows schemat- 
ically one potential approach to this problem. The horizontal arrows show the locations of 
keyframes s where motion is estimated, while other slices indicate video frames t whose 
colors are matched with those predicted by interpolating between the keyframes. Motion es- 
timation can be cast as a global energy minimization problem that simultaneously minimizes 
brightness compatibility and flow compatibility terms between keyframes and other frames, 
in addition to using robust smoothness terms. 

The multi-view framework is potentially even more appropriate for rigid scene motion 
(multi-view stereo) (Section 11.6), where the unknowns at each pixel are disparities and 
occlusion relationships can be determined directly from pixel depths (Szeliski 1999; Kol- 
mogorov and Zabih 2002). However, it may also be applicable to general motion, with the 
addition of models for object accelerations and occlusion relationships. 

8.4.2 Application : Video denoising 

Video denoising is the process of removing noise and other artifacts such as scratches from 
film and video (Kokaram 2004). Unlike single image denoising, where the only information 
available is in the current picture, video denoisers can average or borrow information from 
adjacent frames. However, in order to do this without introducing blur or jitter (irregular 
motion), they need accurate per-pixel motion estimates. 

Exercise 8.7 lists some of the steps required, which include the ability to determine if the 
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current motion estimate is accurate enough to permit averaging with other frames. Gai and 
Kang (2009) describe their recently developed restoration process, which involves a series of 
additional steps to deal with the special characteristics of vintage film. 

8.4.3 Application : De-interlacing 

Another commonly used application of per-pixel motion estimation is video de-interlacing, 
which is the process of converting a video taken with alternating fields of even and odd 
lines to a non-interlaced signal that contains both fields in each frame (de Haan and Bellers 
1998). Two simple de-interlacing techniques are bob, which copies the line above or below 
the missing line from the same field, and weave, which copies the corresponding line from 
the field before or after. The names come from the visual artifacts generated by these two 
simple techniques: bob introduces an up-and-down bobbing motion along strong horizontal 
lines; weave can lead to a “zippering” effect along horizontally translating edges. Replacing 
these copy operators with averages can help but does not completely remove these artifacts. 

A wide variety of improved techniques have been developed for this process, which is 
often embedded in specialized DSP chips found inside video digitization boards in computers 
(since broadcast video is often interlaced, while computer monitors are not). A large class 
of these techniques estimates local per-pixel motions and interpolates the missing data from 
the information available in spatially and temporally adjacent fields. Dai, Baker, and Kang 
(2009) review this literature and propose their own algorithm, which selects among seven 
different interpolation functions at each pixel using an MRF framework. 

8.5 Layered motion 

In many situation, visual motion is caused by the movement of a small number of objects 
at different depths in the scene. In such situations, the pixel motions can be described more 
succinctly (and estimated more reliably) if pixels are grouped into appropriate objects or 
layers (Wang and Adelson 1994). 

Figure 8.14 shows this approach schematically. The motion in this sequence is caused by 
the translational motion of the checkered background and the rotation of the foreground hand. 
The complete motion sequence can be reconstructed from the appearance of the foreground 
and background elements, which can be represented as alpha-matted images ( sprites or video 
objects ) and the parametric motion corresponding to each layer. Displacing and compositing 
these layers in back to front order (Section 3.1.3) recreates the original video sequence. 

Layered motion representations not only lead to compact representations (Wang and 
Adelson 1994; Lee, ge Chen, lung Bruce Lin et al. 1997), but they also exploit the infor- 
mation available in multiple video frames, as well as accurately modeling the appearance of 
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Intensity map 


Frame 1 


Alpha map 



Frame 2 


Velocity map 


Frame 3 




Figure 8.14 Layered motion estimation framework (Wang and Adelson 1994) © 1994 
IEEE: The top two rows describe the two layers, each of which consists of an intensity (color) 
image, an alpha mask (black=transparent), and a parametric motion field. The layers are com- 
posited with different amounts of motion to recreate the video sequence. 


pixels near motion discontinuities. This makes them particularly suited as a representation 
for image-based rendering (Section 13.2.1) (Shade, Gortler, He et al. 1998; Zitnick, Kang, 
Uyttendaele et al. 2004) as well as object-level video editing. 

To compute a layered representation of a video sequence, Wang and Adelson (1994) first 
estimate affine motion models over a collection of non-overlapping patches and then cluster 
these estimates using k-means. They then alternate between assigning pixels to layers and 
recomputing motion estimates for each layer using the assigned pixels, using a technique 
first proposed by Darrell and Pentland (1991). Once the parametric motions and pixel-wise 
layer assignments have been computed for each frame independently, layers are constructed 
by warping and merging the various layer pieces from all of the frames together. Median 
filtering is used to produce sharp composite layers that are robust to small intensity variations, 
as well as to infer occlusion relationships between the layers. Figure 8.15 shows the results 
of this process on the flower garden sequence. You can see both the initial and final layer 
assignments for one of the frames, as well as the composite flow and the alpha-matted layers 
with their corresponding flow vectors overlaid. 

In follow-on work, Weiss and Adelson (1996) use a formal probabilistic mixture model 
to infer both the optimal number of layers and the per-pixel layer assignments. Weiss (1997) 
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layers with pixel assignments and flow 

Figure 8.15 Layered motion estimation results (Wang and Adelson 1994) © 1994 IEEE. 


further generalizes this approach by replacing the per-layer affine motion models with smooth 
regularized per-pixel motion estimates, which allows the system to better handle curved and 
undulating layers, such as those seen in most real-world sequences. 

The above approaches, however, still make a distinction between estimating the motions 
and layer assignments and then later estimating the layer colors. In the system described by 
Baker, Szeliski, and Anandan (1998), the generative model illustrated in Figure 8.14 is gen- 
eralized to account for real-world rigid motion scenes. The motion of each frame is described 
using a 3D camera model and the motion of each layer is described using a 3D plane equation 
plus per-pixel residual depth offsets (the plane plus parallax representation (Section 2.1.5)). 
The initial layer estimation proceeds in a manner similar to that of Wang and Adelson (1994), 
except that rigid planar motions (homographies) are used instead of affine motion models. 
The final model refinement, however, jointly re-optimizes the layer pixel color and opacity 
values Li and the 3D depth, plane, and motion parameters zu ni , and P t by minimizing the 
discrepancy between the re-synthesized and observed motion sequences (Baker, Szeliski, and 
Anandan 1998). 

Figure 8.16 shows the final results obtained with this algorithm. As you can see, the 
motion boundaries and layer assignments are much crisper than those in Figure 8.15. Because 
of the per-pixel depth offsets, the individual layer color values are also sharper than those 
obtained with affine or planar motion models. While the original system of Baker, Szeliski, 
and Anandan (1998) required a rough initial assignment of pixels to layers, Torr, Szeliski, 
and Anandan (2001) describe automated Bayesian techniques for initializing this system and 
determining the optimal number of layers. 

Layered motion estimation continues to be an active area of research. Representative pa- 
pers in this area include (Sawhney and Ayer 1996; Jojic and Frey 2001; Xiao and Shah 2005; 
Kumar, Torr, and Zisserman 2008; Thayananthan, Iwasaki, and Cipolla 2008; Schoenemann 
and Cremers 2008). 
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(g) (li) 


Figure 8.16 Layered stereo reconstruction (Baker, Szeliski, and Anandan 1998) © 1998 
IEEE: (a) first and (b) last input images; (c) initial segmentation into six layers; (d) and 
(e) the six layer sprites; (f) depth map for planar sprites (darker denotes closer); front layer 
(g) before and (h) after residual depth estimation. Note that the colors for the flower garden 
sequence are incorrect; the correct colors (yellow flowers) are shown in Figure 8.15. 
o 


Of course, layers are not the only way to introduce segmentation into motion estimation. 
A large number of algorithms have been developed that alternate between estimating optic 
flow vectors and segmenting them into coherent regions (Black and Jepson 1996; Ju, Black, 
and Jepson 1996; Chang, Tekalp, and Sezan 1997; Memin and Perez 2002; Cremers and 
Soatto 2005). Some of the more recent techniques rely on first segmenting the input color 
images and then estimating per-segment motions that produce a coherent motion field while 
also modeling occlusions (Zitnick, Kang, Uyttendaele et al. 2004; Zitnick, Jojic, and Kang 
2005; Stein, Hoiem, and Hebert 2007; Thayananthan, Iwasaki, and Cipolla 2008). 

8.5.1 Application : Frame interpolation 

Frame interpolation is another widely used application of motion estimation, often imple- 
mented in the same circuitry as de-interlacing hardware required to match an incoming video 
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to a monitor’s actual refresh rate. As with de-interlacing, information from novel in-between 
frames needs to be interpolated from preceding and subsequent frames. The best results can 
be obtained if an accurate motion estimate can be computed at each unknown pixel’s lo- 
cation. However, in addition to computing the motion, occlusion information is critical to 
prevent colors from being contaminated by moving foreground objects that might obscure a 
particular pixel in a preceding or subsequent frame. 

In a little more detail, consider Figure 8.13c and assume that the arrows denote keyframes 
between which we wish to interpolate additional images. The orientations of the streaks 
in this figure encode the velocities of individual pixels. If the same motion estimate Uq is 
obtained at location xq in image /q as is obtained at location Xq + «<o in image ij , the flow 
vectors are said to be consistent. This motion estimate can be transferred to location Xq + tuo 
in the image I t being generated, where t £ (0, 1) is the time of interpolation. The final color 
value at pixel x () + tuo can be computed as a linear blend, 

It( x Q + tu Q ) = (1 - t)I 0 (x 0 ) + tl i(x 0 + m 0 ). (8.72) 

If, however, the motion vectors are different at corresponding locations, some method must 
be used to determine which is correct and which image contains colors that are occluded. 
The actual reasoning is even more subtle than this. One example of such an interpolation 
algorithm, based on earlier work in depth map interpolation (Shade, Gortler, He et al. 1998; 
Zitnick, Kang, Uyttendaele et al. 2004) which is the one used in the flow evaluation paper of 
Baker, Black, Lewis et al. (2007); Baker, Scharstein, Lewis et al. (2009). An even higher- 
quality frame interpolation algorithm, which uses gradient-based reconstruction, is presented 
by Mahajan, Huang, Matusik et al. (2009). 

8.5.2 Transparent layers and reflections 

A special case of layered motion that occurs quite often is transparent motion, which is usu- 
ally caused by reflections seen in windows and picture frames (Figures 8.17 and 8.18). 

Some of the early work in this area handles transparent motion by either just estimating 
the component motions (Shizawa andMase 1991; Bergen, Burt, Hingorani et al. 1992; Darrell 
and Simoncelli 1993; Irani, Rousso, and Peleg 1994) or by assigning individual pixels to 
competing motion layers (Darrell and Pentland 1995; Black and Anandan 1996; Ju, Black, 
and Jepson 1996), which is appropriate for scenes partially seen through a fine occluder 
(e.g., foliage). However, to accurately separate truly transparent layers, a better model for 
motion due to reflections is required. Because of the way that light is both reflected from 
and transmitted through a glass surface, the correct model for reflections is an additive one, 
where each moving layer contributes some intensity to the final image (Szeliski, Avidan, and 
Anandan 2000). 
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(a) (b) (c) 


(d) (e) 

Figure 8.17 Light reflecting off the transparent glass of a picture frame: (a) first image from 
the input sequence; (b) dominant motion layer min-composite\ (c) secondary motion residual 
layer max-composite ; (d-e) final estimated picture and reflection layers The original images 
are from Black and Anandan (1996), while the separated layers are from Szeliski, Avidan, 
and Anandan (2000) © 2000 IEEE. 




If the motions of the individual layers are known, the recovery of the individual layers is 
a simple constrained least squares problem, with the individual layer images are constrained 
to be positive. However, this problem can suffer from extended low-frequency ambiguities, 
especially if either of the layers lacks dark (black) pixels or the motion is uni-directional. In 
their paper, Szeliski, Avidan, and Anandan (2000) show that the simultaneous estimation of 
the motions and layer values can be obtained by alternating between robustly computing the 
motion layers and then making conservative (upper- or lower-bound) estimates of the layer 
intensities. The final motion and layer estimates can then be polished using gradient descent 
on a joint constrained least squares formulation similar to (Baker, Szeliski, and Anandan 
1998), where the over compositing operator is replaced with addition. 

Figures 8.17 and 8.18 show the results of applying these techniques to two different pic- 
ture frames with reflections. Notice how, in the second sequence, the amount of reflected light 
is quite low compared to the transmitted light (the picture of the girl) and yet the algorithm is 
still able to recover both layers. 

Unfortunately, the simple parametric motion models used in (Szeliski, Avidan, and Anan- 
dan 2000) are only valid for planar reflectors and scenes with shallow depth. The extension of 
these techniques to curved reflectors and scenes with significant depth has also been studied 
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(a) (b) (c) (d) (e) 

Figure 8.18 Transparent motion separation (Szeliski, Avidan, and Anandan 2000) © 2000 
IEEE: (a) first image from input sequence; (b) dominant motion layer min-composite\ (c) sec- 
ondary motion residual layer max-composite ; (d-e) final estimated picture and reflection lay- 
ers. Note that the reflected layers in (c) and (e) are doubled in intensity to better show their 
structure. 

(Swaminathan, Kang, Szeliski el al. 2002; Criminisi, Kang, Swaminathan el al. 2005), as has 
the extension to scenes with more complex 3D depth (Tsin, Kang, and Szeliski 2006). 


8.6 Additional reading 

Some of the earliest algorithms for motion estimation were developed for motion-compen- 
sated video coding (Netravali and Robbins 1979) and such techniques continue to be used 
in modern coding standards such as MPEG, H.263, and H.264 (Le Gall 1991; Richardson 
2003). 14 In computer vision, this field was originally called image sequence analysis (Huang 
1981). Some of the early seminal papers include the variational approaches developed by 
Horn and Schunck (1981) and Nagel and Enkelmann (1986), and the patch-based translational 
alignment technique developed by Lucas and Kanade (1981). Hierarchical (coarse-to-fine) 
versions of such algorithms were developed by Quam (1984), Anandan (1989), and Bergen, 
Anandan, Hanna et al. (1992), although they have also long been used in motion estimation 
for video coding. 

Translational motion models were generalized to affine motion by Rehg and Witkin (1991), 
Fuh and Maragos (1991), and Bergen, Anandan, Hanna et al. (1992) and to quadric refer- 
ence surfaces by Shashua and Toelg (1997) and Shashua and Wexler (2001) — see Baker and 
Matthews (2004) for a nice review. Such parametric motion estimation algorithms have found 
widespread application in video summarization (Teodosio and Bender 1993; Irani and Anan- 
dan 1998), video stabilization (Hansen, Anandan, Dana et al. 1994; Srinivasan, Chellappa, 

14 http://www.itu.int/rec/T-REC-H.264. 
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Veeraraghavan et al. 2005; Matsushita, Ofek, Ge et al. 2006), and video compression (Irani, 
Hsu, and Anandan 1995; Lee, ge Chen, lung Bruce Lin et al. 1997). Surveys of parametric 
image registration include those by Brown (1992), Zitov’aa and Flusser (2003), Goshtasby 
(2005), and Szeliski (2006a). 

Good general surveys and comparisons of optic flow algorithms include those by Ag- 
garwal and Nandhakumar (1988), Barron, Fleet, and Beauchemin (1994), Otte and Nagel 
(1994), Mitiche and Bouthemy (1996), Stiller and Konrad (1999), McCane, Novins, Cran- 
nitch et al. (2001), Szeliski (2006a), and Baker, Black, Lewis et al. (2007). The topic of 
matching primitives, i.e., pre-transforming images using filtering or other techniques before 
matching, is treated in a number of papers (Anandan 1989; Bergen, Anandan, Hanna et al. 
1992; Scharstein 1994; Zabih and Woodfill 1994; Cox, Roy, and Hingorani 1995; Viola and 
Wells III 1997; Negahdaripour 1998; Kim, Kolmogorov, and Zabih 2003; Jia and Tang 2003; 
Papenberg, Bruhn, Brox et al. 2006; Seitz and Baker 2009). Hirschmiiller and Scharstein 
(2009) compare a number of these approaches and report on their relative performance in 
scenes with exposure differences. 

The publication of a new benchmark for evaluating optical flow algorithms (Baker, Black, 
Lewis et al. 2007) has led to rapid advances in the quality of estimation algorithms, to the 
point where new datasets may soon become necessary. According to their updated techni- 
cal report (Baker, Scharstein, Lewis et al. 2009), most of the best performing algorithms use 
robust data and smoothness norms (often L i TV) and continuous variational optimization 
techniques, although some techniques use discrete optimization or segmentations (Papen- 
berg, Bruhn, Brox et al. 2006; Trobin, Pock, Cremers et al. 2008; Xu, Chen, and lia 2008; 
Lempitsky, Roth, and Rother. 2008; Werlberger, Trobin, Pock et al. 2009; Lei and Yang 2009; 
Wedel, Cremers, Pock et al. 2009). 


8.7 Exercises 

Ex 8.1: Correlation Implement and compare the performance of the following correlation 
algorithms: 

• sum of squared differences (8.1) 

• sum of robust differences (8.2) 

• sum of absolute differences (8.3) 

• bias-gain compensated squared differences (8.9) 


• normalized cross-correlation (8. 1 1) 
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• windowed versions of the above (8.22-8.23) 

• Fourier-based implementations of the above measures (8.18-8.20) 

• phase correlation (8.24) 

• gradient cross-correlation (Argyriou and Vlachos 2003). 

Compare a few of your algorithms on different motion sequences with different amounts of 
noise, exposure variation, occlusion, and frequency variations (e.g., high-frequency textures, 
such as sand or cloth, and low-frequency images, such as clouds or motion-blurred video). 
Some datasets with illumination variation and ground truth correspondences (horizontal mo- 
tion) can be found at http://vision.middlebury.edu/stereo/data/ (the 2005 and 2006 datasets). 

Some additional ideas, variants, and questions: 

1. When do you think that phase correlation will outperform regular correlation or SSD? 
Can you show this experimentally or justify it analytically? 

2. For the Fourier-based masked or windowed correlation and sum of squared differences, 
the results should be the same as the direct implementations. Note that you will have 
to expand (8.5) into a sum of pairwise correlations, just as in (8.22). (This is part of the 
exercise.) 

3. For the bias-gain corrected variant of squared differences (8.9), you will also have 
to expand the terms to end up with a 3 x 3 (least squares) system of equations. If 
implementing the Fast Fourier Transform version, you will need to figure out how all 
of these entries can be evaluated in the Fourier domain. 

4. (Optional) Implement some of the additional techniques studied by Hirschmiiller and 
Scharstein (2009) and see if your results agree with theirs. 

Ex 8.2: Affine registration Implement a coarse-to-fme direct method for affine and pro- 
jective image alignment. 

1. Does it help to use lower-order (simpler) models at coarser levels of the pyramid 
(Bergen, Anandan, Hanna et al. 1992)? 

2. (Optional) Implement patch-based acceleration (Shum and Szeliski 2000; Baker and 
Matthews 2004). 

3. See the Baker and Matthews (2004) survey for more comparisons and ideas. 

Ex 8.3: Stabilization Write a program to stabilize an input video sequence. You should 
implement the following steps, as described in Section 8.2.1: 
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1. Compute the translation (and, optionally, rotation) between successive frames with ro- 
bust outlier rejection. 

2. Perform temporal high-pass filtering on the motion parameters to remove the low- 
frequency component (smooth the motion). 

3. Compensate for the high-frequency motion, zooming in slightly (a user-specified amount) 
to avoid missing edge pixels. 

4. (Optional) Do not zoom in, but instead borrow pixels from previous or subsequent 
frames to fill in. 

5. (Optional) Compensate for images that are blurry because of fast motion by “stealing” 
higher frequencies from adjacent frames. 

Ex 8.4: Optical flow Compute optical flow (spline-based or per-pixel) between two im- 
ages, using one or more of the techniques described in this chapter. 

1. Test your algorithms on the motion sequences available at http://vision.middlebury. 
edu/flow/ or http://people.csail.mit.edu/celiu/motionAnnotation/ and compare your re- 
sults (visually) to those available on these Web sites. If you think your algorithm is 
competitive with the best, consider submitting it for formal evaluation. 

2. Visualize the quality of your results by generating in-between images using frame in- 
terpolation (Exercise 8.5). 

3. What can you say about the relative efficiency (speed) of your approach? 

Ex 8.5: Automated morphing / frame interpolation Write a program to automatically morph 
between pairs of images. Implement the following steps, as sketched out in Section 8.5.1 and 
by Baker, Scharstein, Lewis el al. (2009): 

1. Compute the flow both ways (previous exercise). Consider using a multi-frame ( n > 2) 
technique to better deal with occluded regions. 

2. For each intermediate (morphed) image, compute a set of flow vectors and which im- 
ages should be used in the final composition. 

3. Blend (cross-dissolve) the images and view with a sequence viewer. 

Try this out on images of your friends and colleagues and see what kinds of morphs you get. 
Alternatively, take a video sequence and do a high-quality slow-motion effect. Compare your 
algorithm with simple cross-fading. 
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Ex 8.6: Motion-based user interaction Write a program to compute a low-resolution mo- 
tion field in order to interactively control a simple application (Cutler and Turk 1998). For 
example: 

1. Downsample each image using a pyramid and compute the optical flow (spline -based 
or pixel-based) from the previous frame. 

2. Segment each training video sequence into different “actions” (e.g., hand moving in- 
wards, moving up, no motion) and “learn” the velocity fields associated with each one. 
(You can simply find the mean and variance for each motion field or use something 
more sophisticated, such as a support vector machine (SVM).) 

3. Write a recognizer that finds successive actions of approximately the right duration and 
hook it up to an interactive application (e.g., a sound generator or a computer game). 

4. Ask your friends to test it out. 

Ex 8.7: Video denoising Implement the algorithm sketched in Application 8.4.2. Your al- 
gorithm should contain the following steps: 

1 . Compute accurate per-pixel flow. 

2. Determine which pixels in the reference image have good matches with other frames. 

3. Either average all of the matched pixels or choose the sharpest image, if trying to 
compensate for blur. Don’t forget to use regular single-frame denoising techniques as 
part of your solution, (see Section 3.4.4, Section 3.7.3, and Exercise 3.11). 

4. Devise a fail-back strategy for areas where you don’t think the flow estimates are accu- 
rate enough. 

Ex 8.8: Motion segmentation Write a program to segment an image into separately mov- 
ing regions or to reliably find motion boundaries. 

Use the human-assisted motion segmentation database at http://people.csail.mit.edu/celiu/ 
motionAnnotation/ as some of your test data. 

Ex 8.9: Layered motion estimation Decompose into separate layers (Section 8.5) a video 
sequence of a scene taken with a moving camera: 

1. Find the set of dominant (affine or planar perspective) motions, either by computing 
them in blocks or finding a robust estimate and then iteratively re-fitting outliers. 

2. Determine which pixels go with each motion. 
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3. Construct the layers by blending pixels from different frames. 

4. (Optional) Add per-pixel residual flows or depths. 

5. (Optional) Refine your estimates using an iterative global optimization technique. 

6. (Optional) Write an interactive Tenderer to generate in-between frames or view the 
scene from different viewpoints (Shade, Gortler, He et al. 1998). 

7. (Optional) Construct an unwrap mosaic from a more complex scene and use this to do 
some video editing (Rav-Acha, Kohli, Fitzgibbon et al. 2008). 

Ex 8.10: Transparent motion and reflection estimation Take a video sequence looking 
through a window (or picture frame) and see if you can remove the reflection in order to 
better see what is inside. 

The steps are described in Section 8.5.2 and by Szeliski, Avidan, and Anandan (2000). 
Alternative approaches can be found in work by Shizawa and Mase (1991), Bergen, Burt, 
Hingorani et al. (1992), Darrell and Simoncelli (1993), Darrell and Pentland (1995), Irani, 
Rousso, and Peleg (1994), Black and Anandan (1996), and Ju, Black, and lepson (1996). 
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Figure 9.1 Image stitching: (a) portion of a cylindrical panorama and (b) a spherical 
panorama constructed from 54 photographs (Szeliski and Shum 1997) © 1997 ACM; (c) a 
multi-image panorama automatically assembled from an unordered photo collection; a multi- 
image stitch (d) without and (e) with moving object removal (Uyttendaele, Eden, and Szeliski 
2001) © 2001 IEEE. 
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Algorithms for aligning images and stitching them into seamless photo-mosaics are among 
the oldest and most widely used in computer vision (Milgram 1975; Peleg 1981). image 
stitching algorithms create the high-resolution photo-mosaics used to produce today’s digital 
maps and satellite photos. They also come bundled with most digital cameras and can be used 
to create beautiful ultra wide-angle panoramas. 

image stitching originated in the photogrammetry community, where more manually in- 
tensive methods based on surveyed ground control points or manually registered tie points 
have long been used to register aerial photos into large-scale photo-mosaics (Slama 1980). 
One of the key advances in this community was the development of bundle adjustment al- 
gorithms (Section 7.4), which could simultaneously solve for the locations of all of the cam- 
era positions, thus yielding globally consistent solutions (Triggs, McLauchlan, Hartley et al. 
1999). Another recurring problem in creating photo-mosaics is the elimination of visible 
seams, for which a variety of techniques have been developed over the years (Milgram 1975, 
1977; Peleg 1981; Davis 1998; Agarwala, Dontcheva, Agrawala et al. 2004) 

In film photography, special cameras were developed in the 1990s to take ultra-wide- 
angle panoramas, often by exposing the film through a vertical slit as the camera rotated on its 
axis (Meehan 1990). In the mid-1990s, image alignment techniques started being applied to 
the construction of wide-angle seamless panoramas from regular hand-held cameras (Mann 
and Picard 1994; Chen 1995; Szeliski 1996). More recent work in this area has addressed 
the need to compute globally consistent alignments (Szeliski and Shum 1997; Sawhney and 
Kumar 1999; Shum and Szeliski 2000), to remove “ghosts” due to parallax and object move- 
ment (Davis 1998; Shum and Szeliski 2000; Uyttendaele, Eden, and Szeliski 2001; Agarwala, 
Dontcheva, Agrawala et al. 2004), and to deal with varying exposures (Mann and Picard 1994; 
Uyttendaele, Eden, and Szeliski 2001; Levin, Zomet, Peleg et al. 2004; Agarwala, Dontcheva, 
Agrawala et al. 2004; Eden, Uyttendaele, and Szeliski 2006; Kopf, Uyttendaele, Deussen et 
al. 2007). 1 These techniques have spawned a large number of commercial stitching products 
(Chen 1995; Sawhney, Kumar, Gendel et al. 1998), of which reviews and comparisons can 
be found on the Web. 2 

While most of the earlier techniques worked by directly minimizing pixel-to-pixel dis- 
similarities, more recent algorithms usually extract a sparse set of features and match them 
to each other, as described in Chapter 4. Such feature-based approaches to image stitching 
have the advantage of being more robust against scene movement and are potentially faster, 
if implemented the right way. Their biggest advantage, however, is the ability to “recognize 
panoramas”, i.e., to automatically discover the adjacency (overlap) relationships among an 
unordered set of images, which makes them ideally suited for fully automated stitching of 

1 A collection of some of these papers was compiled by Benosman and Kang (2001) and they are surveyed by 
Szeliski (2006a). 

2 The Photosynth Web site, http://photosynth.net, allows people to create and upload panoramas for free. 
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panoramas taken by casual users (Brown and Lowe 2007). 

What, then, are the essential problems in image stitching? As with image alignment, we 
must first determine the appropriate mathematical model relating pixel coordinates in one im- 
age to pixel coordinates in another; Section 9.1 reviews the basic models we have studied and 
presents some new motion models related specifically to panoramic image stitching. Next, 
we must somehow estimate the correct alignments relating various pairs (or collections) of 
images. Chapter 4 discussed how distinctive features can be found in each image and then 
efficiently matched to rapidly establish correspondences between pairs of images. Chapter 8 
discussed how direct pixel-to-pixel comparisons combined with gradient descent (and other 
optimization techniques) can also be used to estimate these parameters. When multiple im- 
ages exist in a panorama, bundle adjustment (Section 7.4) can be used to compute a globally 
consistent set of alignments and to efficiently discover which images overlap one another. In 
Section 9.2, we look at how each of these previously developed techniques can be modified 
to take advantage of the imaging setups commonly used to create panoramas. 

Once we have aligned the images, we must choose a final compositing surface for warping 
the aligned images (Section 9.3.1). We also need algorithms to seamlessly cut and blend over- 
lapping images, even in the presence of parallax, lens distortion, scene motion, and exposure 
differences (Section 9. 3. 2-9. 3.4). 


9.1 Motion models 

Before we can register and align images, we need to establish the mathematical relationships 
that map pixel coordinates from one image to another. A variety of such parametric motion 
models are possible, from simple 2D transforms, to planar perspective models, 3D camera 
rotations, lens distortions, and mapping to non-planar (e.g., cylindrical) surfaces. 

We already covered several of these models in Sections 2.1 and 6.1. In particular, we saw 
in Section 2. 1 .5 how the parametric motion describing the deformation of a planar surfaced 
as viewed from different positions can be described with an eight-parameter homography 
(2.71) (Mann and Picard 1994; Szeliski 1996). We also saw how a camera undergoing a pure 
rotation induces a different kind of homography (2.72). 

In this section, we review both of these models and show how they can be applied to dif- 
ferent stitching situations. We also introduce spherical and cylindrical compositing surfaces 
and show how, under favorable circumstances, they can be used to perform alignment using 
pure translations (Section 9.1.6). Deciding which alignment model is most appropriate for a 
given situation or set of data is a model selection problem (Hastie, Tibshirani, and Friedman 
2001; Torr 2002; Bishop 2006; Robert 2007), an important topic we do not cover in this book. 
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(a) translation [2 dof] (b) affine [6 dof] (c) perspective [8 dof] (d) 3D rotation [3+ dof] 
Figure 9.2 Two-dimensional motion models and how they can be used for image stitching. 


9.1.1 Planar perspective motion 

The simplest possible motion model to use when aligning images is to simply translate and 
rotate them in 2D (Figure 9.2a). This is exactly the same kind of motion that you would 
use if you had overlapping photographic prints. It is also the kind of technique favored by 
David Hockney to create the collages that he calls joiners (Zelnik-Manor and Perona 2007; 
Nomura, Zhang, and Nayar 2007). Creating such collages, which show visible seams and 
inconsistencies that add to the artistic effect, is popular on Web sites such as Flickr, where they 
more commonly go under the name panography (Section 6.1.2). Translation and rotation are 
also usually adequate motion models to compensate for small camera motions in applications 
such as photo and video stabilization and merging (Exercise 6.1 and Section 8.2.1). 

In Section 6. 1 .3, we saw how the mapping between two cameras viewing a common plane 
can be described using a 3 x 3 homography (2.7 1). Consider the matrix M 10 that arises when 
mapping a pixel in one image to a 3D point and then back onto a second image, 

*i ~ PiP^xo = M 10 x 0 . (9.1) 

When the last row of the Pq matrix is replaced with a plane equation ho -p + co and points are 
assumed to lie on this plane, i.e., their disparity is do = 0, we can ignore the last column of 
M io and also its last row, since we do not care about the final z-buffer depth. The resulting 
homography matrix H l0 (the upper left 3x3 sub-matrix of M w ) describes the mapping 
between pixels in the two images, 

*i ~ H w x 0 . (9.2) 

This observation formed the basis of some of the earliest automated image stitching al- 
gorithms (Mann and Picard 1994; Szeliski 1994, 1996). Because reliable feature matching 
techniques had not yet been developed, these algorithms used direct pixel value matching, i.e., 
direct parametric motion estimation, as described in Section 8.2 and Equations (6.19-6.20). 
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More recent stitching algorithms first extract features and then match them up, often using 
robust techniques such as RANSAC (Section 6. 1 .4) to compute a good set of inliers. The final 
computation of the homography (9.2), i.e., the solution of the least squares fitting problem 
given pairs of corresponding features, 


uses iterative least squares, as described in Section 6.1.3 and Equations (6.21-6.23). 

9.1.2 Application : Whiteboard and document scanning 

The simplest image-stitching application is to stitch together a number of image scans taken 
on a flatbed scanner. Say you have a large map, or a piece of child’s artwork, that is too large 
to fit on your scanner. Simply take multiple scans of the document, making sure to overlap 
the scans by a large enough amount to ensure that there are enough common features. Next, 
take successive pairs of images that you know overlap, extract features, match them up, and 
estimate the 2D rigid transform (2.16), 


that best matches the features, using two-point RANSAC, if necessary, to find a good set 
of inliers. Then, on a final compositing surface (aligned with the first scan, for example), 
resample your images (Section 3.6.1) and average them together. Can you see any potential 
problems with this scheme? 

One complication is that a 2D rigid transformation is non-linear in the rotation angle 9, 
so you will have to either use non-linear least squares or constrain R to be orthonormal, as 
described in Section 6.1.3. 

A bigger problem lies in the pairwise alignment process. As you align more and more 
pairs, the solution may drift so that it is no longer globally consistent. In this case, a global op- 
timization procedure, as described in Section 9.2, may be required. Such global optimization 
often requires a large system of non-linear equations to be solved, although in some cases, 
such as linearized homographies (Section 9.1.3) or similarity transforms (Section 6.1.2), reg- 
ular least squares may be an option. 

A slightly more complex scenario is when you take multiple overlapping handheld pic- 
tures of a whiteboard or other large planar object (He and Zhang 2005; Zhang and He 2007). 
Here, the natural motion model to use is a homography, although a more complex model that 
estimates the 3D rigid motion relative to the plane (plus the focal length, if unknown), could 
in principle be used. 



(9.3) 
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n„: 

(0,0,0,l)-/7= 0 



Figure 9.3 Pure 3D camera rotation. The form of the homography (mapping) is particularly 
simple and depends only on the 3D rotation matrix and focal lengths. 


9.1.3 Rotational panoramas 


The most typical case for panoramic image stitching is when the camera undergoes a pure ro- 
tation. Think of standing at the rim of the Grand Canyon. Relative to the distant geometry in 
the scene, as you snap away, the camera is undergoing a pure rotation, which is equivalent to 
assuming that all points are very far from the camera, i.e., on the plane at infinity (Figure 9.3). 
Setting to = ft = 0, we get the simplified 3x3 homography 

H 10 = K^R^Kq 1 = KrfwKo 1 , (9.5) 


where Kk = diag (//., fk, 1) is the simplified camera intrinsic matrix (2.59), assuming that 
c x = c y = 0, i.e., we are indexing the pixels starting from the optical center (Szeliski 1996). 
This can also be re-written as 


or 
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which reveals the simplicity of the mapping equations and makes all of the motion parameters 
explicit. Thus, instead of the general eight-parameter homography relating a pair of images, 
we get the three-, four-, or five-parameter 3D rotation motion models corresponding to the 
cases where the focal length / is known, fixed, or variable (Szeliski and Shum 1997). 3 Es- 
timating the 3D rotation matrix (and, optionally, focal length) associated with each image is 


3 An initial estimate of the focal lengths can be obtained using the intrinsic calibration techniques described in 
Section 6.3.4 or from EXIF tags. 
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intrinsically more stable than estimating a homography with a full eight degrees of freedom, 
which makes this the method of choice for large-scale image stitching algorithms (Szeliski 
and Shum 1997; Shum and Szeliski 2000; Brown and Lowe 2007). 

Given this representation, how do we update the rotation matrices to best align two over- 
lapping images? Given a current estimate for the homography H io in (9.5), the best way to 
update i? 10 is to prepend an incremental rotation matrix Il(uj) to the current estimate i? 10 
(Szeliski and Shum 1997; Shum and Szeliski 2000), 

H(lj) = K^RJMRwKq 1 = [K 1 R(u:)K^ 1 ][K 1 R 10 Ko 1 ) = DH 10 . (9.8) 


Note that here we have written the update rule in the compositional form, where the in- 
cremental update D is prepended to the current homography H \ () . Using the small-angle 
approximation to R(lj) given in (2.35), we can write the incremental update matrix as 


D = K^u)^ 1 » Kiil+lul^Ki 1 


1 -W 2 flOJy 

Uz 1 -flU x 

-Vy/fl Ux/fl 1 


(9.9) 


Notice how there is now a nice one-to-one correspondence between the entries in the D 
matrix and the /iqo, . . . , h -> i parameters used in Table 6.1 and Equation (6.19), i.e.. 


(ft'OOj ftoij ^02, hoo, hll, hl2, /l20) ^2l) — (0, — U) z , flljJy, LO z , 0, —flU>x, —Wy/fl,U)x/ fl)- 

(9.10) 

We can therefore apply the chain rule to Equations (6.24 and 9. 10) to obtain 
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which give us the linearized update equations needed to estimate u; = (uj x . oj y . uj 2 ). 4 Notice 
that this update rule depends on the focal length /i of the target view and is independent 
of the focal length f 0 of the template view. This is because the compositional algorithm 
essentially makes small perturbations to the target. Once the incremental rotation vector uj 
has been computed, the Ri rotation matrix can be updated using R \ <— R.(u>)R.\ . 

The formulas for updating the focal length estimates are a little more involved and are 
given in (Shum and Szeliski 2000). We will not repeat them here, since an alternative up- 
date rule, based on minimizing the difference between back-projected 3D rays, is given in 
Section 9.2.1. Figure 9.4 shows the alignment of four images under the 3D rotation motion 
model. 

4 This is the same as the rotational component of instantaneous rigid flow (Bergen, Anandan, Hanna et at. 1992) 
and the update equations given by Szeliski and Shum (1997) and Shum and Szeliski (2000). 
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Figure 9.4 Four images taken with a hand-held camera registered using a 3D rotation mo- 
tion model (Szeliski and Shum 1997) © 1997 ACM. Notice how the homographies, rather 
than being arbitrary, have a well-defined keystone shape whose width increases away from 
the origin. 


9.1.4 Gap closing 

The techniques presented in this section can be used to estimate a series of rotation matrices 
and focal lengths, which can be chained together to create large panoramas. Unfortunately, 
because of accumulated errors, this approach will rarely produce a closed 360° panorama. 
Instead, there will invariably be either a gap or an overlap (Figure 9.5). 

We can solve this problem by matching the first image in the sequence with the last one. 
The difference between the two rotation matrix estimates associated with the repeated first 
indicates the amount of misregistration. This error can be distributed evenly across the whole 
sequence by taking the quotient of the two quaternions associated with these rotations and 
dividing this “error quaternion” by the number of images in the sequence (assuming relatively 
constant inter-frame rotations). We can also update the estimated focal length based on the 
amount of misregistration. To do this, we first convert the error quaternion into a gap angle, 
0 g and then update the focal length using the equation /' = /(I — 0 fl /36O°). 

Figure 9.5a shows the end of registered image sequence and the first image. There is a 
big gap between the last image and the first which are in fact the same image. The gap is 
32° because the wrong estimate of focal length (/ = 510) was used. Figure 9.5b shows the 
registration after closing the gap with the correct focal length (/ = 468). Notice that both 
mosaics show very little visual misregistration (except at the gap), yet Figure 9.5a has been 
computed using a focal length that has 9% error. Related approaches have been developed by 
Hartley (1994b), McMillan and Bishop (1995), Stein (1995), and Kang and Weiss (1997) to 
solve the focal length estimation problem using pure panning motion and cylindrical images. 
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Figure 9.5 Gap closing (Szeliski and Shum 1997) © 1997 ACM: (a) A gap is visible when 
the focal length is wrong (/ = 510). (b) No gap is visible for the correct focal length 
(/ = 468). 


Unfortunately, this particular gap-closing heuristic only works for the kind of “one-dimensional” 
panorama where the camera is continuously turning in the same direction. In Section 9.2, we 
describe a different approach to removing gaps and overlaps that works for arbitrary camera 
motions. 


9.1.5 Application : Video summarization and compression 

An interesting application of image stitching is the ability to summarize and compress videos 
taken with a panning camera. This application was first suggested by Teodosio and Bender 
(1993), who called their mosaic -based summaries salient stills. These ideas were then ex- 
tended by Irani, Hsu, and Anandan (1995), Kumar, Anandan, Irani el al. (1995), and Irani and 
Anandan (1998) to additional applications, such as video compression and video indexing. 
While these early approaches used affine motion models and were therefore restricted to long 
focal lengths, the techniques were generalized by Lee, ge Chen, lung Bruce Lin et al. (1997) 
to full eight-parameter homographies and incorporated into the MPEG-4 video compression 
standard, where the stitched background layers were called video sprites (Figure 9.6). 

While video stitching is in many ways a straightforward generalization of multiple-image 
stitching (Steedly, Pal, and Szeliski 2005; Baudisch, Tan, Steedly et al. 2006), the potential 
presence of large amounts of independent motion, camera zoom, and the desire to visualize 
dynamic events impose additional challenges. For example, moving foreground objects can 
often be removed using median filtering. Alternatively, foreground objects can be extracted 
into a separate layer (Sawhney and Ayer 1996) and later composited back into the stitched 
panoramas, sometimes as multiple instances to give the impressions of a “Chronophotograph” 
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Figure 9.6 Video stitching the background scene to create a single sprite image that can be 
transmitted and used to re-create the background in each frame (Lee, ge Chen, lung Bruce Lin 
etal. 1997)© 1997 IEEE. 


(Massey and Bender 1996) and sometimes as video overlays (Irani and Anandan 1998). 
Videos can also be used to create animated panoramic video textures (Section 13.5.2), in 
which different portions of a panoramic scene are animated with independently moving video 
loops (Agarwala, Zheng, Pal el al. 2005; Rav-Acha, Pritch, Lischinski el al. 2005), or to shine 
“video flashlights” onto a composite mosaic of a scene (Sawhney, Arpa, Kumar et al. 2002). 

Video can also provide an interesting source of content for creating panoramas taken from 
moving cameras. While this invalidates the usual assumption of a single point of view (opti- 
cal center), interesting results can still be obtained. For example, the VideoBrush system of 
Sawhney, Kumar, Gendel et al. (1998) uses thin strips taken from the center of the image to 
create a panorama taken from a horizontally moving camera. This idea can be generalized 
to other camera motions and compositing surfaces using the concept of mosaics on adap- 
tive manifold (Peleg, Rousso, Rav-Acha et al. 2000), and also used to generate panoramic 
stereograms (Peleg, Ben-Ezra, and Pritch 2001). Related ideas have been used to create 
panoramic matte paintings for multi-plane cel animation (Wood, Finkelstein, Hughes et al. 
1997), for creating stitched images of scenes with parallax (Kumar, Anandan, Irani et al. 
1995), and as 3D representations of more complex scenes using multiple-center-of-projection 
images (Rademacher and Bishop 1998) and multi-perspective panoramas (Roman, Garg, and 
Levoy 2004; Roman and Lensch 2006; Agarwala, Agrawala, Cohen et al. 2006). 

Another interesting variant on video-based panoramas are concentric mosaics (Section 13.3.3) 
(Shum and He 1999). Here, rather than trying to produce a single panoramic image, the com- 
plete original video is kept and used to re-synthesize views (from different camera origins) 
using ray remapping (light field rendering), thus endowing the panorama with a sense of 3D 
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Figure 9.7 Projection from 3D to (a) cylindrical and (b) spherical coordinates. 

depth. The same data set can also be used to explicitly reconstruct the depth using multi- 
baseline stereo (Peleg, Ben-Ezra, and Pritch 2001; Li, Shum, Tang el al. 2004; Zheng, Kang, 
Cohen et al. 2007). 



i = (X,Y,Z) 

(sin 6 cos <p, sinp, 
cosO cos <p) 


9.1.6 Cylindrical and spherical coordinates 

An alternative to using homographies or 3D motions to align images is to first warp the images 
into cylindrical coordinates and then use a pure translational model to align them (Chen 1995; 
Szeliski 1996). Unfortunately, this only works if the images are all taken with a level camera 
or with a known tilt angle. 

Assume for now that the camera is in its canonical position, i.e., its rotation matrix is the 
identity, R = I, so that the optical axis is aligned with the z axis and the y axis is aligned 
vertically. The 3D ray corresponding to an (a;, y) pixel is therefore (a;, y, /). 

We wish to project this image onto a cylindrical surface of unit radius (Szeliski 1996). 
Points on this surface are parameterized by an angle 9 and a height h, with the 3D cylindrical 
coordinates corresponding to ( 9 , h) given by 

(sin 9, h, cos 9) oc ( x,y,f ), (9.12) 


as shown in Figure 9.7a. From this correspondence, we can compute the formula for the 
warped or mapped coordinates (Szeliski and Shum 1997), 


f „ I Jy 

x = s9 = s tan — , 
y' = sh = s 




(9.13) 

(9.14) 


where s is an arbitrary scaling factor (sometimes called the radius of the cylinder) that can be 
set to s = f to minimize the distortion (scaling) near the center of the image. 5 The inverse of 

5 The scale can also be set to a larger or smaller value for the final compositing surface, depending on the desired 
output panorama resolution — see Section 9.3. 
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this mapping equation is given by 


x = f t&n 6 = f tan — , 
s 

y = h\J x 2 + f 2 = —f\/l + tan 2 x'/s = f— sec — . 

s v s s 


(9.15) 

(9.16) 


Images can also be projected onto a spherical surface (Szeliski and Shum 1997), which 
is useful if the final panorama includes a full sphere or hemisphere of views, instead of just 
a cylindrical strip. In this case, the sphere is parameterized by two angles ( 9 , (j > ), with 3D 
spherical coordinates given by 


(sin 9 cos (j>, sin cos 9 cos <j>) oc (. x,y,f ) , 


(9.17) 


as shown in Figure 9.7b. 6 The correspondence between coordinates is now given by (Szeliski 
and Shum 1997): 


v = sO = s tan — 
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while the inverse is given by 
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f tan 9 = f tan — , 
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s/x 2 + f 2 tan </> = tan —f\l 1 + tan 2 x'/s = / tan — sec — . 
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(9.20) 

(9.21) 


Note that it may be simpler to generate a scaled (x,y,z) direction from Equation (9.17) 
followed by a perspective division by z and a scaling by /. 

Cylindrical image stitching algorithms are most commonly used when the camera is 
known to be level and only rotating around its vertical axis (Chen 1995). Under these condi- 
tions, images at different rotations are related by a pure horizontal translation. 7 This makes 
it attractive as an initial class project in an introductory computer vision course, since the 
full complexity of the perspective alignment algorithm (Sections 6.1, 8.2, and 9.1.3) can be 
avoided. Figure 9.8 shows how two cylindrically warped images from a leveled rotational 
panorama are related by a pure translation (Szeliski and Shum 1997). 

Professional panoramic photographers often use pan-tilt heads that make it easy to control 
the tilt and to stop at specific detents in the rotation angle. Motorized rotation heads are also 

6 Note that these are not the usual spherical coordinates, first presented in Equation (2.8). Here, the y axis points 
at the north pole instead of the z axis, since we are used to viewing images taken horizontally, i.e., with the y axis 
pointing in the direction of the gravity vector. 

7 Small vertical tilts can sometimes be compensated for with vertical translations. 
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(a) (b) 


Figure 9.8 A cylindrical panorama (Szeliski and Shum 1997) © 1997 ACM: (a) two cylin- 
drically warped images related by a horizontal translation; (b) part of a cylindrical panorama 
composited from a sequence of images. 



Figure 9.9 A spherical panorama constructed from 54 photographs (Szeliski and Shum 
1997) © 1997 ACM. 


sometimes used for the acquisition of larger panoramas (Kopf, Uyttendaele, Deussen et al. 
2007). 8 Not only do they ensure a uniform coverage of the visual field with a desired amount 
of image overlap but they also make it possible to stitch the images using cylindrical or 
spherical coordinates and pure translations. In this case, pixel coordinates (x, y. f) must first 
be rotated using the known tilt and panning angles before being projected into cylindrical 
or spherical coordinates (Chen 1995). Having a roughly known panning angle also makes it 
easier to compute the alignment, since the rough relative positioning of all the input images is 
known ahead of time, enabling a reduced search range for alignment. Figure 9.9 shows a full 
3D rotational panorama unwrapped onto the surface of a sphere (Szeliski and Shum 1997). 

One final coordinate mapping worth mentioning is the polar mapping, where the north 


See also http://gigapan.org. 
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pole lies along the optical axis rather than the vertical axis, 

(cos 6 sin </>, sin 6 sin <j), cos (j>) = s (x, y, z) . (9.22) 

In this case, the mapping equations become 

x' = sficos 8 = s— tan -1 (9.23) 

r z 

y' = scj) sin 6 = s— tan -1 (9.24) 

r z 

where r = \/x 2 + y 2 is the radial distance in the (x, y) plane and s(j> plays a similar role 
in the (x', y') plane. This mapping provides an attractive visualization surface for certain 
kinds of wide-angle panoramas and is also a good model for the distortion induced by fisheye 
lenses, as discussed in Section 2.1.6. Note how for small values of (x,y), the mapping 
equations reduce to x' ~ sx/z, which suggests that s plays a role similar to the focal length 
/■ 

9.2 Global alignment 

So far, we have discussed how to register pairs of images using a variety of motion models. In 
most applications, we are given more than a single pair of images to register. The goal is then 
to find a globally consistent set of alignment parameters that minimize the mis-registration 
between all pairs of images (Szeliski and Shum 1997; Shum and Szeliski 2000; Sawhney and 
Kumar 1999; Coorg and Teller 2000). 

In this section, we extend the pairwise matching criteria (6.2, 8.1, and 8.50) to a global 
energy function that involves all of the per-image pose parameters (Section 9.2.1). Once 
we have computed the global alignment, we often need to perform local adjustments, such 
as parallax removal, to reduce double images and blurring due to local mis-registrations 
(Section 9.2.2). Finally, if we are given an unordered set of images to register, we need to 
discover which images go together to form one or more panoramas. This process of panorama 
recognition is described in Section 9.2.3. 

9.2.1 Bundle adjustment 

One way to register a large number of images is to add new images to the panorama one 
at a time, aligning the most recent image with the previous ones already in the collection 
(Szeliski and Shum 1997) and discovering, if necessary, which images it overlaps (Sawhney 
and Kumar 1999). In the case of 360° panoramas, accumulated error may lead to the presence 
of a gap (or excessive overlap) between the two ends of the panorama, which can be fixed 
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by stretching the alignment of all the images using a process called gap closing (Szeliski and 
Shum 1997). However, a better alternative is to simultaneously align all the images using a 
least-squares framework to correctly distribute any mis-registration errors. 

The process of simultaneously adjusting pose parameters for a large collection of overlap- 
ping images is called bundle adjustment in the photogrammetry community (Triggs, McLauch- 
lan. Hartley el al. 1999). In computer vision, it was first applied to the general structure from 
motion problem (Szeliski and Kang 1994) and then later specialized for panoramic image 
stitching (Shum and Szeliski 2000; Sawhney and Kumar 1999; Coorg and Teller 2000). 

In this section, we formulate the problem of global alignment using a feature-based ap- 
proach, since this results in a simpler system. An equivalent direct approach can be obtained 
either by dividing images into patches and creating a virtual feature correspondence for each 
one (as discussed in Section 9.2.4 and by Shum and Szeliski (2000)) or by replacing the 
per-feature error metrics with per-pixel metrics. 

Consider the feature-based alignment problem given in Equation (6.2), i.e., 

-Epairwise-LS = E H r ^ = II ^'ii. x i\P) ~ *il| 2 - (9.25) 

i 

For multi-image alignment, instead of having a single collection of pairwise feature corre- 
spondences, {(a:,, xf)}, we have a collection of n features, with the location of the ith feature 
point in the jth image denoted by x %:! and its scalar confidence (i.e., inverse variance) denoted 
by Cij. 9 Each image also has some associated pose parameters. 

In this section, we assume that this pose consists of a rotation matrix Rj and a focal 
length fj , although formulations in terms of homographies are also possible (Szeliski and 
Shum 1997; Sawhney and Kumar 1999). The equation mapping a 3D point tc, into a point 
Xij in frame j can be re-written from Equations (2.68) and (9.5) as 

Xij ~ KjRjXi and Xj, ~ Rj 1 Kj 1 Xij, (9.26) 

where Kj = d'vdg(f v fj, 1) is the simplified form of the calibration matrix. The motion 
mapping a point x^j from frame j into a point x lk in frame k is similarly given by 

x ik ~ HkjXij = K k R k R~ 1 K~ 1 x i j. (9.27) 

Given an initial set of {(Rj, fj)} estimates obtained from chaining pairwise alignments, how 
do we refine these estimates? 

One approach is to directly extend the pairwise energy E’pairwise-LS (9.25) to a multiview 
formulation, 

^all —pairs— 2D EE Cij^ik || •Bik 5 ? fj 5 ? fk) ? (9.28) 

i jk 

9 Features that are not seen in image j have Cij = 0. We can also use 2x2 inverse covariance matrices £7". in 
place of Cij , as shown in Equation (6. 11). 
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where the x^ function is the predicted location of feature i in frame k given by (9.27), 
Xij is the observed location, and the “2D” in the subscript indicates that an image-plane 
error is being minimized (Shum and Szeliski 2000). Note that since x^ depends on the Xij 
observed value, we actually have an errors-in-variable problem, which in principle requires 
more sophisticated techniques than least squares to solve (Van Huffel and Lemmerling 2002; 
Matei and Meer 2006). However, in practice, if we have enough features, we can directly 
minimize the above quantity using regular non-linear least squares and obtain an accurate 
multi-frame alignment. 

While this approach works well in practice, it suffers from two potential disadvantages. 
First, since a summation is taken over all pairs with corresponding features, features that are 
observed many times are overweighted in the final solution. (In effect, a feature observed m 
times gets counted (™J times instead of m times.) Second, the derivatives of x^ with respect 
to the {(Rj, fj)} are a little cumbersome, although using the incremental correction to Rj 
introduced in Section 9.1.3 makes this more tractable. 

An alternative way to formulate the optimization is to use true bundle adjustment, i.e., to 
solve not only for the pose parameters {(-Rj, fj)} but also for the 3D point positions {x , }, 

Rba-2D = ^ ^ 1 c ij\\Xij(Xj\ Rj, fj) Xij || , (9.29) 

* j 

where Xij(x,:. Rj , fj) is given by (9.26). The disadvantage of full bundle adjustment is that 
there are more variables to solve for, so each iteration and also the overall convergence may 
be slower. (Imagine how the 3D points need to “shift” each time some rotation matrices are 
updated.) However, the computational complexity of each linearized Gauss-Newton step can 
be reduced using sparse matrix techniques (Section 7.4.1) (Szeliski and Kang 1994; Triggs, 
McLauchlan, Hartley et al. 1999; Hartley and Zisserman 2004). 

An alternative formulation is to minimize the error in 3D projected ray directions (Shum 
and Szeliski 2000), i.e., 

Rba-3D = ^ ^ ' Cjj | x, (xjj : Rj , fj ) Xi\\ , (9.30) 

* j 

where x, (x /:i : Rj, fj) is given by the second half of (9.26). This has no particular advantage 
over (9.29). In fact, since errors are being minimized in 3D ray space, there is a bias towards 
estimating longer focal lengths, since the angles between rays become smaller as / increases. 

However, if we eliminate the 3D rays a we can derive a pairwise energy formulated in 
3D ray space (Shum and Szeliski 2000), 

Rail —pairs— 3D — EE Gj G/c 1 1 Xj (Xjj ! Rj , fj ) XiiXik', Rki fk) || ■ 

i jk 


(9.31) 
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This results in the simplest set of update equations (Shum and Szeliski 2000), since the f k can 
be folded into the creation of the homogeneous coordinate vector as in Equation (9.7). Thus, 
even though this formula over-weights features that occur more frequently, it is the method 
used by Shum and Szeliski (2000) and Brown, Szeliski, and Winder (2005). In order to reduce 
the bias towards longer focal lengths, we multiply each residual (3D error) by yj fjf k , which 
is similar to projecting the 3D rays into a “virtual camera” of intermediate focal length. 

Up vector selection. As mentioned above, there exists a global ambiguity in the pose of the 
3D cameras computed by the above methods. While this may not appear to matter, people 
prefer that the final stitched image is “upright” rather than twisted or tilted. More concretely, 
people are used to seeing photographs displayed so that the vertical (gravity) axis points 
straight up in the image. Consider how you usually shoot photographs: while you may pan 
and tilt the camera any which way, you usually keep the horizontal edge of your camera (its 
x-axis) parallel to the ground plane (perpendicular to the world gravity direction). 

Mathematically, this constraint on the rotation matrices can be expressed as follows. Re- 
call from Equation (9.26) that the 3D to 2D projection is given by 

x lk ~ K k R k Xi. (9.32) 

We wish to post-multiply each rotation matrix R k by a global rotation R g such that the pro- 
jection of the global y- axis, j = (0, 1, 0) is perpendicular to the image x-axis, % = (1, 0, O ). 10 

This constraint can be written as 

i T R k R g y = 0 (9.33) 

(note that the scaling by the calibration matrix is irrelevant here). This is equivalent to re- 
quiring that the first row of R k , r k0 = i T R k be perpendicular to the second column of R g , 
r g i = R g j- This set of constraints (one per input image) can be written as a least squares 
problem, 

r g i = argmin^^(r T rfco) 2 = argmin r T 

k L k 

Thus, r g i is the smallest eigenvector of the scatter or moment matrix spanned by the indi- 
vidual camera rotation x-vectors, which should generally be of the form (c, 0, s) when the 
cameras are upright. 

To fully specify the R g global rotation, we need to specify one additional constraint. This 
is related to the view selection problem discussed in Section 9.3.1. One simple heuristic is to 

10 Note that here we use the convention common in computer graphics that the vertical world axis corresponds to 
y. This is a natural choice if we wish the rotation matrix associated with a “regular” image taken horizontally to be 
the identity, rather than a 90° rotation around the x-axis. 
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prefer the average z-axis of the individual rotation matrices, k = k R k to be close to 
the world z-axis, r g 2 = R„k. We can therefore compute the full rotation matrix R g in three 
steps: 

1. r gl = min eigenvector (J2k r k o r ko)’ 

2. r g0 = Af((£ fc r fc2 ) x r g i); 

3. r g 2 = r g0 x r gl , 

where A f(v) = tt/||w|| normalizes a vector v. 

9.2.2 Parallax removal 

Once we have optimized the global orientations and focal lengths of our cameras, we may find 
that the images are still not perfectly aligned, i.e., the resulting stitched image looks blurry 
or ghosted in some places. This can be caused by a variety of factors, including unmodeled 
radial distortion, 3D parallax (failure to rotate the camera around its optical center), small 
scene motions such as waving tree branches, and large-scale scene motions such as people 
moving in and out of pictures. 

Each of these problems can be treated with a different approach. Radial distortion can be 
estimated (potentially ahead of time) using one of the techniques discussed in Section 2.1.6. 
For example, the plumb-line method (Brown 1971; Kang 2001; El-Melegy and Farag 2003) 
adjusts radial distortion parameters until slightly curved lines become straight, while mosaic- 
based approaches adjust them until mis-registration is reduced in image overlap areas (Stein 
1997; Sawhney and Kumar 1999). 

3D parallax can be handled by doing a full 3D bundle adjustment, i.e., by replacing the 
projection equation (9.26) used in Equation (9.29) with Equation (2.68), which models cam- 
era translations. The 3D positions of the matched feature points and cameras can then be si- 
multaneously recovered, although this can be significantly more expensive than parallax-free 
image registration. Once the 3D structure has been recovered, the scene could (in theory) be 
projected to a single (central) viewpoint that contains no parallax. However, in order to do 
this, dense stereo correspondence needs to be performed (Section 11.3) (Li, Shum, Tang et al. 
2004; Zheng, Kang, Cohen et al. 2007), which may not be possible if the images contain only 
partial overlap. In that case, it may be necessary to correct for parallax only in the overlap 
areas, which can be accomplished using a multi-perspective plane sweep (MPPS) algorithm 
(Kang, Szeliski, and Uyttendaele 2004; Uyttendaele, Criminisi, Kang et al. 2004). 

When the motion in the scene is very large, i.e., when objects appear and disappear com- 
pletely, a sensible solution is to simply select pixels from only one image at a time as the 
source for the final composite (Milgram 1977; Davis 1998; Agarwala, Dontcheva, Agrawala 
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et al. 2004), as discussed in Section 9.3.2. However, when the motion is reasonably small (on 
the order of a few pixels), general 2D motion estimation (optical flow) can be used to perform 
an appropriate correction before blending using a process called local alignment (Shum and 
Szeliski 2000; Kang, Uyttendaele, Winder et al. 2003). This same process can also be used 
to compensate for radial distortion and 3D parallax, although it uses a weaker motion model 
than explicitly modeling the source of error and may, therefore, fail more often or introduce 
unwanted distortions. 

The local alignment technique introduced by Shum and Szeliski (2000) starts with the 
global bundle adjustment (9.31) used to optimize the camera poses. Once these have been 
estimated, the desired location of a 3D point at; can be estimated as the average of the back- 
projected 3D locations, 


which can be projected into each image j to obtain a target location Xij. The difference 
between the target locations Xij and the original features x l3 provide a set of local motion 
estimates 


which can be interpolated to form a dense correction field Uj(xj). In their system, Shum and 
Szeliski (2000) use an inverse warping algorithm where the sparse —u t] values are placed at 
the new target locations Xij, interpolated using bilinear kernel functions (Nielson 1993) and 
then added to the original pixel coordinates when computing the warped (corrected) image. 
In order to get a reasonably dense set of features to interpolate, Shum and Szeliski (2000) 
place a feature point at the center of each patch (the patch size controls the smoothness in 
the local alignment stage), rather than relying of features extracted using an interest operator 
(Figure 9.10). 

An alternative approach to motion-based de-ghosting was proposed by Kang, Uytten- 
daele, Winder et al. (2003), who estimate dense optical flow between each input image and a 
central reference image. The accuracy of the flow vector is checked using a photo-consistency 
measure before a given warped pixel is considered valid and is used to compute a high dy- 
namic range radiance estimate, which is the goal of their overall algorithm. The requirement 
for a reference image makes their approach less applicable to general image mosaicing, al- 
though an extension to this case could certainly be envisaged. 

9.2.3 Recognizing panoramas 

The final piece needed to perform fully automated image stitching is a technique to recognize 
which images actually go together, which Brown and Lowe (2007) call recognizing panora- 



(9.35) 


tiij Xjj X j j . 
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(a) (b) (c) 


Figure 9.10 Deghosting a mosaic with motion parallax (Shum and Szeliski 2000) © 2000 
IEEE: (a) composite with parallax; (b) after a single deghosting step (patch size 32); (c) after 
multiple steps (sizes 32, 16 and 8). 


mas. If the user takes images in sequence so that each image overlaps its predecessor and 
also specifies the first and last images to be stitched, bundle adjustment combined with the 
process of topology inference can be used to automatically assemble a panorama (Sawhney 
and Kumar 1999). However, users often jump around when taking panoramas, e.g., they 
may start a new row on top of a previous one, jump back to take a repeat shot, or create 
360° panoramas where end-to-end overlaps need to be discovered. Furthermore, the ability 
to discover multiple panoramas taken by a user over an extended period of time can be a big 
convenience. 

To recognize panoramas. Brown and Lowe (2007) first find all pairwise image overlaps 
using a feature-based method and then find connected components in the overlap graph to 
“recognize” individual panoramas (Figure 9.11). The feature-based matching stage first ex- 
tracts scale invariant feature transform (SIFT) feature locations and feature descriptors (Lowe 
2004) from all the input images and places them in an indexing structure, as described in Sec- 
tion 4.1.3. For each image pair under consideration, the nearest matching neighbor is found 
for each feature in the first image, using the indexing structure to rapidly find candidates and 
then comparing feature descriptors to find the best match. RANSAC is used to find a set of in- 
lier matches; pairs of matches are used to hypothesize similarity motion models that are then 
used to count the number of inliers. (A more recent RANSAC algorithm tailored specifically 
for rotational panoramas is described by Brown, Hartley, and Nister (2007).) 

In practice, the most difficult part of getting a fully automated stitching algorithm to 
work is deciding which pairs of images actually correspond to the same parts of the scene. 
Repeated structures such as windows (Figure 9.12) can lead to false matches when using 
a feature-based approach. One way to mitigate this problem is to perform a direct pixel- 
based comparison between the registered images to determine if they actually are different 
views of the same scene. Unfortunately, this heuristic may fail if there are moving objects 
in the scene (Figure 9.13). While there is no magic bullet for this problem, short of full 
scene understanding, further improvements can likely be made by applying domain-specific 
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Figure 9.11 Recognizing panoramas (Brown, Szeliski, and Winder 2005), figures cour- 
tesy of Matthew Brown: (a) input images with pairwise matches; (b) images grouped into 
connected components (panoramas); (c) individual panoramas registered and blended into 
stitched composites. 
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Figure 9.12 Matching errors (Brown, Szeliski, and Winder 2004): accidental matching of 
several features can lead to matches between pairs of images that do not actually overlap. 



Figure 9.13 Validation of image matches by direct pixel error comparison can fail when the 
scene contains moving objects (Uyttendaele, Eden, and Szeliski 2001) © 2001 IEEE. 
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heuristics, such as priors on typical camera motions as well as machine learning techniques 
applied to the problem of match validation. 

9.2.4 Direct vs. feature-based alignment 

Given that there exist these two approaches to aligning images, which is preferable? 

Early feature-based methods would get confused in regions that were either too textured 
or not textured enough. The features would often be distributed unevenly over the images, 
thereby failing to match image pairs that should have been aligned. Furthermore, establishing 
correspondences relied on simple cross-correlation between patches surrounding the feature 
points, which did not work well when the images were rotated or had foreshortening due to 
homographies. 

Today, feature detection and matching schemes are remarkably robust and can even be 
used for known object recognition from widely separated views (Lowe 2004). Features not 
only respond to regions of high “cornerness” (Forstner 1986; Harris and Stephens 1988) but 
also to “blob-like” regions (Lowe 2004), and uniform areas (Matas, Chum, Urban el al. 2004; 
Tuytelaars and Van Gool 2004). Furthermore, because they operate in scale-space and use a 
dominant orientation (or orientation invariant descriptors), they can match images that differ 
in scale, orientation, and even foreshortening. Our own experience in working with feature- 
based approaches is that if the features are well distributed over the image and the descriptors 
reasonably designed for repeatability, enough correspondences to permit image stitching can 
usually be found (Brown, Szeliski, and Winder 2005). 

The biggest disadvantage of direct pixel-based alignment techniques is that they have a 
limited range of convergence. Even though they can be used in a hierarchical (coarse-to- 
fine) estimation framework, in practice it is hard to use more than two or three levels of a 
pyramid before important details start to be blurred away. 1 1 For matching sequential frames 
in a video, direct approaches can usually be made to work. However, for matching partially 
overlapping images in photo-based panoramas or for image collections where the contrast or 
content varies too much, they fail too often to be useful and feature-based approaches are 
therefore preferred. 

9.3 Compositing 

Once we have registered all of the input images with respect to each other, we need to decide 
how to produce the final stitched mosaic image. This involves selecting a final compositing 
surface (flat, cylindrical, spherical, etc.) and view (reference image). It also involves selecting 

1 1 Fourier-based correlation (Szeliski 1 996; Szeliski and Shunt 1 997 ) can extend this range but requires cylindrical 
images or motion prediction to be useful. 
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which pixels contribute to the final composite and how to optimally blend these pixels to 
minimize visible seams, blur, and ghosting. 

In this section, we review techniques that address these problems, namely compositing 
surface parameterization, pixel and seam selection, blending, and exposure compensation. 
My emphasis is on fully automated approaches to the problem. Since the creation of high- 
quality panoramas and composites is as much an artistic endeavor as a computational one, 
various interactive tools have been developed to assist this process (Agarwala, Dontcheva, 
Agrawala et al. 2004; Li, Sun, Tang et al. 2004; Rother, Kolmogorov, and Blake 2004). 
Some of these are covered in more detail in Section 10.4. 

9.3.1 Choosing a compositing surface 

The first choice to be made is how to represent the final image. If only a few images are 
stitched together, a natural approach is to select one of the images as the reference and to 
then warp all of the other images into its reference coordinate system. The resulting com- 
posite is sometimes called aflat panorama, since the projection onto the final surface is still 
a perspective projection, and hence straight lines remain straight (which is often a desirable 
attribute). 12 

For larger fields of view, however, we cannot maintain a flat representation without ex- 
cessively stretching pixels near the border of the image. (In practice, flat panoramas start 
to look severely distorted once the field of view exceeds 90° or so.) The usual choice for 
compositing larger panoramas is to use a cylindrical (Chen 1995; Szeliski 1996) or spherical 
(Szeliski and Shum 1997) projection, as described in Section 9.1.6. In fact, any surface used 
for environment mapping in computer graphics can be used, including a cube map , which 
represents the full viewing sphere with the six square faces of a cube (Greene 1986; Szeliski 
and Shum 1997). Cartographers have also developed a number of alternative methods for 
representing the globe (Bugayevskiy and Snyder 1995). 

The choice of parameterization is somewhat application dependent and involves a trade- 
off between keeping the local appearance undistorted (e.g., keeping straight lines straight) 
and providing a reasonably uniform sampling of the environment. Automatically making 
this selection and smoothly transitioning between representations based on the extent of the 
panorama is an active area of current research (Kopf, Uyttendaele, Deussen et al. 2007). 

An interesting recent development in panoramic photography has been the use of stereo- 
graphic projections looking down at the ground (in an outdoor scene) to create “little planet” 
renderings. 13 

1 2 Recently, some techniques have been developed to straighten curved lines in cylindrical and spherical panora- 
mas (Carroll. Agrawala, and Agarwala 2009; Kopf, Lischinski, Deussen et al. 2009). 

13 These are inspired by The Little Prince by Antoine De Saint-Exupery. Go to http://www.flickr.com and search 
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View selection. Once we have chosen the output parameterization, we still need to deter- 
mine which part of the scene will be centered in the final view. As mentioned above, for a flat 
composite, we can choose one of the images as a reference. Often, a reasonable choice is the 
one that is geometrically most central. For example, for rotational panoramas represented as 
a collection of 3D rotation matrices, we can choose the image whose z-axis is closest to the 
average z-axis (assuming a reasonable field of view). Alternatively, we can use the average 
z-axis (or quaternion, but this is trickier) to define the reference rotation matrix. 

For larger, e.g., cylindrical or spherical, panoramas, we can use the same heuristic if a 
subset of the viewing sphere has been imaged. In the case of full 360° panoramas, a better 
choice might be to choose the middle image from the sequence of inputs, or sometimes the 
first image, assuming this contains the object of greatest interest. In all of these cases, having 
the user control the final view is often highly desirable. If the “up vector” computation de- 
scribed in Section 9.2.1 is working correctly, this can be as simple as panning over the image 
or setting a vertical “center line” for the final panorama. 

Coordinate transformations. After selecting the parameterization and reference view, we 
still need to compute the mappings between the input and output pixels coordinates. 

If the final compositing surface is flat (e.g., a single plane or the face of a cube map) 
and the input images have no radial distortion, the coordinate transformation is the simple 
homography described by (9.5). This kind of warping can be performed in graphics hardware 
by appropriately setting texture mapping coordinates and rendering a single quadrilateral. 

If the final composite surface has some other analytic form (e.g., cylindrical or spherical), 
we need to convert every pixel in the final panorama into a viewing ray (3D point) and then 
map it back into each image according to the projection (and optionally radial distortion) 
equations. This process can be made more efficient by precomputing some lookup tables, 
e.g., the partial trigonometric functions needed to map cylindrical or spherical coordinates to 
3D coordinates or the radial distortion field at each pixel. It is also possible to accelerate this 
process by computing exact pixel mappings on a coarser grid and then interpolating these 
values. 

When the final compositing surface is a texture-mapped polyhedron, a slightly more so- 
phisticated algorithm must be used. Not only do the 3D and texture map coordinates have to 
be properly handled, but a small amount of overdraw outside the triangle footprints in the tex- 
ture map is necessary, to ensure that the texture pixels being interpolated during 3D rendering 
have valid values (Szeliski and Shum 1997). 

Sampling issues. While the above computations can yield the correct (fractional) pixel 
addresses in each input image, we still need to pay attention to sampling issues. For example, 


for “little planet projection”. 
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if the final panorama has a lower resolution than the input images, pre-filtering the input 
images is necessary to avoid aliasing. These issues have been extensively studied in both the 
image processing and computer graphics communities. The basic problem is to compute the 
appropriate pre-filter, which depends on the distance (and arrangement) between neighboring 
samples in a source image. As discussed in Sections 3.5.2 and 3.6.1, various approximate 
solutions, such as MIP mapping (Williams 1983) or elliptically weighted Gaussian averaging 
(Greene and Heckbert 1986) have been developed in the graphics community. For highest 
visual quality, a higher order (e.g., cubic) interpolator combined with a spatially adaptive pre- 
filter may be necessary (Wang, Kang, Szeliski el al. 2001). Under certain conditions, it may 
also be possible to produce images with a higher resolution than the input images using the 
process of super-resolution (Section 10.3). 

9.3.2 Pixel selection and weighting (de-ghosting) 

Once the source pixels have been mapped onto the final composite surface, we must still 
decide how to blend them in order to create an attractive-looking panorama. If all of the 
images are in perfect registration and identically exposed, this is an easy problem, i.e., any 
pixel or combination will do. However, for real images, visible seams (due to exposure 
differences), blurring (due to mis-registration), or ghosting (due to moving objects) can occur. 

Creating clean, pleasing-looking panoramas involves both deciding which pixels to use 
and how to weight or blend them. The distinction between these two stages is a little fluid, 
since per-pixel weighting can be thought of as a combination of selection and blending. In 
this section, we discuss spatially varying weighting, pixel selection (seam placement), and 
then more sophisticated blending. 

Feathering and center-weighting. The simplest way to create a final composite is to sim- 
ply take an average value at each pixel. 


where. On computer graphics hardware, this kind of summation can be performed in an 
accumulation buffer (using the A channel as the weight). 

Simple averaging usually does not work very well, since exposure differences, mis- 
registrations, and scene movement are all very visible (Figure 9.14a). If rapidly moving 
objects are the only problem, taking a median filter (which is a kind of pixel selection opera- 
tor) can often be used to remove them (Figure 9.14b) (Irani and Anandan 1998). Conversely, 
center-weighting (discussed below) and minimum likelihood selection (Agarwala, Dontcheva, 



(9.37) 


where liJx) are the warped (re-sampled) images and Wk(x) is 1 at valid pixels and 0 else- 
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Figure 9.14 Final composites computed by a variety of algorithms (Szeliski 2006a): (a) 
average, (b) median, (c) feathered average, (d) p-norrn p = 10, (e) Voronoi, (f) weighted 
ROD vertex cover with feathering, (g) graph cut seams with Poisson blending and (h) with 
pyramid blending. 
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Agrawala et al. 2004) can sometimes be used to retain multiple copies of a moving object 
(Figure 9.17). 

A better approach to averaging is to weight pixels near the center of the image more 
heavily and to down-weight pixels near the edges. When an image has some cutout regions, 
down-weighting pixels near the edges of both cutouts and the image is preferable. This can 
be done by computing a distance map or grassfire transform , 

w k (x) = arg min{||y|| | I k (x + y) is invalid }, (9.38) 

where each valid pixel is tagged with its Euclidean distance to the nearest invalid pixel (Sec- 
tion 3.3.3). The Euclidean distance map can be efficiently computed using a two-pass raster 
algorithm (Danielsson 1980; Borgefors 1986). 

Weighted averaging with a distance map is often called feathering (Szeliski and Shum 
1997; Chen and Klette 1999; Uyttendaele, Eden, and Szeliski 2001) and does a reasonable job 
of blending over exposure differences. However, blurring and ghosting can still be problems 
(Figure 9.14c). Note that weighted averaging is not the same as compositing the individual 
images with the classic over operation (Porter and Duff 1984; Blinn 1994a), even when using 
the weight values (normalized to sum up to one) as alpha (translucency) channels. This is 
because the over operation attenuates the values from more distant surfaces and, hence, is not 
equivalent to a direct sum. 

One way to improve feathering is to raise the distance map values to some large power, 
i.e., to use ufix) in Equation (9.37). The weighted averages then become dominated by 
the larger values, i.e., they act somewhat like a p-norm. The resulting composite can often 
provide a reasonable tradeoff between visible exposure differences and blur (Figure 9.14d). 

In the limit as p — > oo, only the pixel with the maximum weight is selected, 

C(x) = I l(x) {x), (9.39) 

where 

l = argmaxwtfa;) (9.40) 

k 

is the label assignment or pixel selection function that selects which image to use at each 
pixel. This hard pixel selection process produces a visibility mask-sensitive variant of the fa- 
miliar Voronoi diagram , which assigns each pixel to the nearest image center in the set (Wood, 
Finkelstein, Hughes et al. 1997; Peleg, Rousso, Rav-Acha et al. 2000). The resulting com- 
posite, while useful for artistic guidance and in high-overlap panoramas ( manifold mosaics ) 
tends to have very hard edges with noticeable seams when the exposures vary (Figure 9.14e). 

Xiong and Turkowski (1998) use this Voronoi idea (local maximum of the grassfire trans- 
form) to select seams for Laplacian pyramid blending (which is discussed below). However, 
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Figure 9.15 Computation of regions of difference (RODs) (Uyttendaele, Eden, and Szeliski 
2001) © 2001 IEEE: (a) three overlapping images with a moving face; (b) corresponding 
RODs; (c) graph of coincident RODs. 


since the seam selection is performed sequentially as new images are added in, some artifacts 
can occur. 


Optimal seam selection. Computing the Voronoi diagram is one way to select the seams 
between regions where different images contribute to the final composite. However, Voronoi 
images totally ignore the local image structure underlying the seam. 

A better approach is to place the seams in regions where the images agree, so that tran- 
sitions from one source to another are not visible. In this way, the algorithm avoids “cutting 
through” moving objects where a seam would look unnatural (Davis 1998). For a pair of 
images, this process can be formulated as a simple dynamic program starting from one edge 
of the overlap region and ending at the other (Milgram 1975, 1977; Davis 1998; Efros and 
Freeman 2001). 

When multiple images are being composited, the dynamic program idea does not readily 
generalize. (For square texture tiles being composited sequentially, Efros and Freeman (2001) 
run a dynamic program along each of the four tile sides.) 

To overcome this problem, Uyttendaele, Eden, and Szeliski (2001) observed that, for 
well-registered images, moving objects produce the most visible artifacts, namely translu- 
cent looking ghosts. Their system therefore decides which objects to keep and which ones 
to erase. First, the algorithm compares all overlapping input image pairs to determine re- 
gions of difference (RODs) where the images disagree. Next, a graph is constructed with the 
RODs as vertices and edges representing ROD pairs that overlap in the final composite (Fig- 
ure 9.15). Since the presence of an edge indicates an area of disagreement, vertices (regions) 
must be removed from the final composite until no edge spans a pair of remaining vertices. 
The smallest such set can be computed using a vertex cover algorithm. Since several such 
covers may exist, a weighted vertex cover is used instead, where the vertex weights are com- 
puted by summing the feather weights in the ROD (Uyttendaele, Eden, and Szeliski 2001). 
The algorithm therefore prefers removing regions that are near the edge of the image, which 
reduces the likelihood that partially visible objects will appear in the final composite. (It is 
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Figure 9.16 Photomontage (Agarwala, Dontcheva, Agrawala el al. 2004) © 2004 ACM. 
From a set of five source images (of which four are shown on the left). Photomontage quickly 
creates a composite family portrait in which everyone is smiling and looking at the camera 
(right). Users simply flip through the stack and coarsely draw strokes using the designated 
source image objective over the people they wish to add to the composite. The user-applied 
strokes and computed regions (middle) are color-coded by the borders of the source images 
on the left. 

also possible to infer which object in a region of difference is the foreground object by the 
“edginess” (pixel differences) across the ROD boundary, which should be higher when an 
object is present (Herley 2005).) Once the desired excess regions of difference have been 
removed, the final composite can be created by feathering (Figure 9. 14f). 

A different approach to pixel selection and seam placement is described by Agarwala, 
Dontcheva, Agrawala et al. (2004). Their system computes the label assignment that opti- 
mizes the sum of two objective functions. The first is a per-pixel image objective that deter- 
mines which pixels are likely to produce good composites, 

C D = ^ £(*,((*)), (9 ' 41) 

x 

where D{x , l) is the data penalty associated with choosing image l at pixel x. In their system, 
users can select which pixels to use by “painting” over an image with the desired object or 
appearance, which sets D(x,l) to a large value for all labels l other than the one selected 
by the user (Figure 9.16). Alternatively, automated selection criteria can be used, such as 
maximum likelihood , which prefers pixels that occur repeatedly in the background (for object 
removal), or minimum likelihood for objects that occur infrequently, i.e., for moving object 
retention. Using a more traditional center-weighted data term tends to favor objects that are 
centered in the input images (Figure 9.17). 

The second term is a seam objective that penalizes differences in labelings between adja- 
cent images, 

C s = ^2 S{x,y,l(x),l(y)), 

(x.y) eN 


(9.42) 
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Figure 9.17 Set of five photos tracking a snowboarder’s jump stitched together into a seam- 
less composite. Because the algorithm prefers pixels near the center of the image, multiple 
copies of the boarder are retained. 

where S(x,y,l x ,l v ) is the image-dependent interaction penalty or seam cost of placing a 
seam between pixels x and y, and A" is the set of A /4 neighboring pixels. For example, 
the simple color-based seam penalty used in (Kwatra, Schodl, Essa et al. 2003; Agarwala, 
Dontcheva, Agrawala el al. 2004) can be written as 


More sophisticated seam penalties can also look at image gradients or the presence of image 
edges (Agarwala, Dontcheva, Agrawala et al. 2004). Seam penalties are widely used in other 
computer vision applications such as stereo matching (Boykov, Veksler, and Zabih 2001) to 
give the labeling function its coherence or smoothness. An alternative approach, which places 
seams along strong consistent edges in overlapping images using a watershed computation is 
described by Soille (2006). 

The sum of these two objective functions gives rise to a Markov random field (MRF), 
for which good optimization algorithms are described in Sections 3.7.2 and 5.5 and Ap- 
pendix B.5. For label computations of this kind, the a-expansion algorithm developed by 
Boykov, Veksler, and Zabih (2001) works particularly well (Szeliski, Zabih, Scharstein et al. 
2008). 

For the result shown in Figure 9.14g, Agarwala, Dontcheva, Agrawala et al. (2004) use 
a large data penalty for invalid pixels and 0 for valid pixels. Notice how the seam placement 
algorithm avoids regions of difference, including those that border the image and that might 
result in objects being cut off. Graph cuts (Agarwala, Dontcheva, Agrawala et al. 2004) and 



(9.43) 
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vertex cover (Uyttendaele, Eden, and Szeliski 2001) often produce similar looking results, 
although the former is significantly slower since it optimizes over all pixels, while the latter 
is more sensitive to the thresholds used to determine regions of difference. 

9.3.3 Application : Photomontage 

While image stitching is normally used to composite partially overlapping photographs, it 
can also be used to composite repeated shots of a scene taken with the aim of obtaining the 
best possible composition and appearance of each element. 

Figure 9.16 shows the Photomontage system developed by Agarwala, Dontcheva, Agrawala 
et al. (2004), where users draw strokes over a set of pre-aligned images to indicate which re- 
gions they wish to keep from each image. Once the system solves the resulting multi-label 
graph cut (9.41-9.42), the various pieces taken from each source photo are blended together 
using a variant of Poisson image blending (9.44-9.46). Their system can also be used to au- 
tomatically composite an all-focus image from a series of bracketed focus images (Hasinoff, 
Kutulakos, Durand et al. 2009) or to remove wires and other unwanted elements from sets of 
photographs. Exercise 9.10 has you implement this system and try out some of its variants. 

9.3.4 Blending 

Once the seams between images have been determined and unwanted objects removed, we 
still need to blend the images to compensate for exposure differences and other mis-alignments. 
The spatially varying weighting (feathering) previously discussed can often be used to accom- 
plish this. However, it is difficult in practice to achieve a pleasing balance between smoothing 
out low-frequency exposure variations and retaining sharp enough transitions to prevent blur- 
ring (although using a high exponent in feathering can help). 

Laplacian pyramid blending. An attractive solution to this problem is the Laplacian pyra- 
mid blending technique developed by Burt and Adelson (1983b), which we discussed in Sec- 
tion 3.5.5. Instead of using a single transition width, a frequency-adaptive width is used by 
creating a band-pass (Laplacian) pyramid and making the transition widths within each level 
a function of the level, i.e., the same width in pixels. In practice, a small number of levels, 
i.e., as few as two (Brown and Lowe 2007), may be adequate to compensate for differences 
in exposure. The result of applying this pyramid blending is shown in Figure 9.14h. 

Gradient domain blending. An alternative approach to multi-band image blending is to 
perform the operations in the gradient domain. Reconstructing images from their gradient 
fields has a long history in computer vision (Horn 1986), starting originally with work in 
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Figure 9.18 Poisson image editing (Perez, Gangnet, and Blake 2003) © 2003 ACM: (a) 
The dog and the two children are chosen as source images to be pasted into the destination 
swimming pool, (b) Simple pasting fails to match the colors at the boundaries, whereas (c) 
Poisson image blending masks these differences. 


brightness constancy (Horn 1974), shape from shading (Horn and Brooks 1989), and photo- 
metric stereo (Woodham 1981). More recently, related ideas have been used for reconstruct- 
ing images from their edges (Elder and Goldberg 2001), removing shadows from images 
(Weiss 2001), separating reflections from a single image (Levin, Zomet, and Weiss 2004; 
Levin and Weiss 2007), and tone mapping high dynamic range images by reducing the mag- 
nitude of image edges (gradients) (Fattal, Lischinski, and Werman 2002). 

Perez, Gangnet, and Blake (2003) show how gradient domain reconstruction can be used 
to do seamless object insertion in image editing applications (Figure 9.18). Rather than copy- 
ing pixels, the gradients of the new image fragment are copied instead. The actual pixel values 
for the copied area are then computed by solving a Poisson equation that locally matches the 
gradients while obeying the fixed Dirichlet (exact matching) conditions at the seam bound- 
ary. Perez, Gangnet, and Blake (2003) show that this is equivalent to computing an additive 
membrane interpolant of the mismatch between the source and destination images along the 
boundary. 14 In earlier work, Peleg (1981) also proposed adding a smooth function to enforce 
consistency along the seam curve. 

Agarwala, Dontcheva, Agrawala et al. (2004) extended this idea to a multi-source formu- 
lation, where it no longer makes sense to talk of a destination image whose exact pixel values 
must be matched at the seam. Instead, each source image contributes its own gradient field 
and the Poisson equation is solved using Neumann boundary conditions, i.e., dropping any 
equations that involve pixels outside the boundary of the image. 

14 The membrane interpolant is known to have nicer interpolation properties for arbitrary-shaped constraints than 
frequency-domain interpolants (Nielson 1993). 
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Rather than solving the Poisson partial differential equations, Agarwala, Dontcheva, Agrawala 
et al. (2004) directly minimize a variational problem, 

min||VC'(*)-V7 I(x) (®)|| 2 . (9.44) 

C(X) 

The discretized form of this equation is a set of gradient constraint equations 

C(x + i)-C(x) = Ii( X )(x + i) - Ii( X ){x) and (9.45) 

C(x + j) - C(x) = Ip X )(x + j) - Ii( X )( x), (9.46) 

where i = (1, 0) and j = (0, 1) are unit vectors in the x and y directions. 15 They then solve 
the associated sparse least squares problem. Since this system of equations is only defined 
up to an additive constraint, Agarwala, Dontcheva, Agrawala et al. (2004) ask the user to 
select the value of one pixel. In practice, a better choice might be to weakly bias the solution 
towards reproducing the original color values. 

In order to accelerate the solution of this sparse linear system, Fattal, Lischinski, and 
Werman (2002) use multigrid, whereas Agarwala, Dontcheva, Agrawala el al. (2004) use 
hierarchical basis preconditioned conjugate gradient descent (Szeliski 1990b, 2006b) (Ap- 
pendix A. 5). In subsequent work, Agarwala (2007) shows how using a quadtree represen- 
tation for the solution can further accelerate the computation with minimal loss in accuracy, 
while Szeliski, Uyttendaele, and Steedly (2008) show how representing the per-image offset 
fields using even coarser splines is even faster. This latter work also argues that blending 
in the log domain, i.e., using multiplicative rather than additive offsets, is preferable, as it 
more closely matches texture contrasts across seam boundaries. The resulting seam blending 
works very well in practice (Figure 9.14h), although care must be taken when copying large 
gradient values near seams so that a “double edge” is not introduced. 

Copying gradients directly from the source images after seam placement is just one ap- 
proach to gradient domain blending. The paper by Levin, Zomet, Peleg et al. (2004) examines 
several different variants of this approach, which they call Gradient-domain Image STitching 
(GIST). The techniques they examine include feathering (blending) the gradients from the 
source images, as well as using an LI norm in performing the reconstruction of the image 
from the gradient field, rather than using an L2 norm as in Equation (9.44). Their preferred 
technique is the LI optimization of a feathered (blended) cost function on the original image 
gradients (which they call GIST1 -l t ). Since LI optimization using linear programming can 
be slow, they develop a faster iterative median-based algorithm in a multigrid framework. 
Visual comparisons between their preferred approach and what they call optimal seam on 
the gradients (which is equivalent to the approach of Agarwala, Dontcheva, Agrawala et al. 
(2004)) show similar results, while significantly improving on pyramid blending and feather- 
ing algorithms. 

15 At seam locations, the right hand side is replaced by the average of the gradients in the two source images. 
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Exposure compensation. Pyramid and gradient domain blending can do a good job of 
compensating for moderate amounts of exposure differences between images. However, 
when the exposure differences become large, alternative approaches may be necessary. 

Uyttendaele, Eden, and Szeliski (2001) iteratively estimate a local correction between 
each source image and a blended composite. First, a block-based quadratic transfer function is 
fit between each source image and an initial feathered composite. Next, transfer functions are 
averaged with their neighbors to get a smoother mapping and per-pixel transfer functions are 
computed by splining (interpolating) between neighboring block values. Once each source 
image has been smoothly adjusted, a new feathered composite is computed and the process is 
repeated (typically three times). The results shown by Uyttendaele, Eden, and Szeliski (2001) 
demonstrate that this does a better job of exposure compensation than simple feathering and 
can handle local variations in exposure due to effects such as lens vignetting. 

Ultimately, however, the most principled way to deal with exposure differences is to stitch 
images in the radiance domain, i.e., to convert each image into a radiance image using its 
exposure value and then create a stitched, high dynamic range image, as discussed in Sec- 
tion 10.2 (Eden, Uyttendaele, and Szeliski 2006). 


9.4 Additional reading 

The literature on image stitching dates back to work in the photogrammetry community in 
the 1970s (Milgram 1975, 1977; Slama 1980). In computer vision, papers started appearing 
in the early 1980s (Peleg 1981), while the development of fully automated techniques came 
about a decade later (Mann and Picard 1994; Chen 1995; Szeliski 1996; Szeliski and Shum 
1997; Sawhney and Kumar 1999; Shum and Szeliski 2000). Those techniques used direct 
pixel-based alignment but feature-based approaches are now the norm (Zoghlami, Faugeras, 
and Deriche 1997; Capel and Zisserman 1998; Cham and Cipolla 1998; Badra, Qumsieh, and 
Dudek 1998; McLauchlan and Jaenicke 2002; Brown and Lowe 2007). A collection of some 
of these papers can be found in the book by Benosman and Kang (2001). Szeliski (2006a) 
provides a comprehensive survey of image stitching, on which the material in this chapter is 
based. 

High-quality techniques for optimal seam finding and blending are another important 
component of image stitching systems. Important developments in this field include work by 
Milgram (1977), Burt and Adelson (1983b), Davis (1998), Uyttendaele, Eden, and Szeliski 
(2001), Perez, Gangnet, and Blake (2003), Levin, Zomet, Peleg et al. (2004), Agarwala, 
Dontcheva, Agrawala et al. (2004), Eden, Uyttendaele, and Szeliski (2006), and Kopf, Uyt- 
tendaele, Deussen et al. (2007). 

In addition to the merging of multiple overlapping photographs taken for aerial or ter- 
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restrial panoramic image creation, stitching techniques can be used for automated white- 
board scanning (He and Zhang 2005; Zhang and He 2007), scanning with a mouse (Nakao, 
Kashitani, and Kaneyoshi 1998), and retinal image mosaics (Can, Stewart, Roysam et al. 
2002). They can also be applied to video sequences (Teodosio and Bender 1993; Irani, Hsu, 
and Anandan 1995; Kumar, Anandan, Irani et al. 1995; Sawhney and Ayer 1996; Massey 
and Bender 1996; Irani and Anandan 1998; Sawhney, Arpa, Kumar et al. 2002; Agarwala, 
Zheng, Pal et al. 2005; Rav-Acha, Pritch, Lischinski et al. 2005; Steedly, Pal, and Szeliski 
2005; Baudisch, Tan, Steedly et al. 2006) and can even be used for video compression (Lee, 
ge Chen, lung Bruce Lin et al. 1997). 

9.5 Exercises 

Ex 9.1: Direct pixel-based alignment Take a pair of images, compute a coarse-to-fine affine 
alignment (Exercise 8.2) and then blend them using either averaging (Exercise 6.2) or a Lapla- 
cian pyramid (Exercise 3.20). Extend your motion model from affine to perspective (homog- 
raphy) to better deal with rotational mosaics and planar surfaces seen under arbitrary motion. 

Ex 9.2: Featured-based stitching Extend your feature-based alignment technique from Ex- 
ercise 6.2 to use a full perspective model and then blend the resulting mosaic using either 
averaging or more sophisticated distance-based feathering (Exercise 9.9). 

Ex 9.3: Cylindrical strip panoramas To generate cylindrical or spherical panoramas from 
a horizontally panning (rotating) camera, it is best to use a tripod. Set your camera up to take 
a series of 50% overlapped photos and then use the following steps to create your panorama: 

1. Estimate the amount of radial distortion by taking some pictures with lots of long 
straight lines near the edges of the image and then using the plumb-line method from 
Exercise 6.10. 

2. Compute the focal length either by using a ruler and paper, as in Figure 6.7 (Debevec, 
Wenger, Tchou et al. 2002) or by rotating your camera on the tripod, overlapping the 
images by exactly 0% and counting the number of images it takes to make a 360° 
panorama. 

3. Convert each of your images to cylindrical coordinates using (9.12-9.16). 

4. Line up the images with a translational motion model using either a direct pixel-based 
technique, such as coarse-to-fine incremental or an FFT, or a feature-based technique. 

5. (Optional) If doing a complete 360° panorama, align the first and last images. Compute 
the amount of accumulated vertical mis-registration and re-distribute this among the 
images. 
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6. Blend the resulting images using feathering or some other technique. 

Ex 9.4: Coarse alignment Use FFT or phase correlation (Section 8.1.2) to estimate the 
initial alignment between successive images. How well does this work? Over what range of 
overlaps? If it does not work, does aligning sub-sections (e.g., quarters) do better? 

Ex 9.5: Automated mosaicing Use feature-based alignment with four-point RANSAC for 
homographies (Section 6.1.3, Equations (6.19-6.23)) or three-point RANSAC for rotational 
motions (Brown, Hartley, and Nister 2007) to match up all pairs of overlapping images. 

Merge these pairwise estimates together by finding a spanning tree of pairwise relations. 
Visualize the resulting global alignment, e.g., by displaying a blend of each image with all 
other images that overlap it. 

For greater robustness, try multiple spanning trees (perhaps randomly sampled based on 
the confidence in pairwise alignments) to see if you can recover from bad pairwise matches 
(Zach, Klopschitz, and Pollefeys 2010). As a measure of fitness, count how many pairwise 
estimates are consistent with the global alignment. 

Ex 9.6: Global optimization Use the initialization from the previous algorithm to perform 
a full bundle adjustment over all of the camera rotations and focal lengths, as described in 
Section 7.4 and by Shum and Szeliski (2000). Optionally, estimate radial distortion parame- 
ters as well or support fisheye lenses (Section 2.1.6). 

As in the previous exercise, visualize the quality of your registration by creating compos- 
ites of each input image with its neighbors, optionally blinking between the original image 
and the composite to better see mis-alignment artifacts. 

Ex 9.7: De-ghosting Use the results of the previous bundle adjustment to predict the loca- 
tion of each feature in a consensus geometry. Use the difference between the predicted and 
actual feature locations to correct for small mis-registrations, as described in Section 9.2.2 
(Shum and Szeliski 2000). 

Ex 9.8: Compositing surface Choose a compositing surface (Section 9.3.1), e.g., a single 
reference image extended to a larger plane, a sphere represented using cylindrical or spherical 
coordinates, a stereographic “little planet” projection, or a cube map. 

Project all of your images onto this surface and blend them with equal weighting, for now 
(just to see where the original image seams are). 

Ex 9.9: Feathering and blending Compute a feather (distance) map for each warped source 
image and use these maps to blend the warped images. 

Alternatively, use Laplacian pyramid blending (Exercise 3.20) or gradient domain blend- 
ing. 
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Ex 9.10: Photomontage and object removal Implement a “PhotoMontage” system in which 
users can indicate desired or unwanted regions in pre -registered images using strokes or other 
primitives (such as bounding boxes). 

(Optional) Devise an automatic moving objects remover (or “keeper”) by analyzing which 
inconsistent regions are more or less typical given some consensus (e.g., median filtering) of 
the aligned images. Figure 9.17 shows an example where the moving object was kept. Try 
to make this work for sequences with large amounts of overlaps and consider averaging the 
images to make the moving object look more ghosted. 
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Figure 10.1 Computational photography: (a) merging multiple exposures to create high 
dynamic range images (Debevec and Malik 1997) © 1997 ACM; (b) merging flash and non- 
flash photographs; (Petschnigg, Agrawala, Hoppe el al. 2004) © 2004 ACM; (c) image mat- 
ting and compositing; (Chuang, Curless, Salesin el al. 2001) © 2001 IEEE; (d) hole filling 
with inpainting (Criminisi, Perez, and Toyama 2004) © 2004 IEEE. 
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Stitching multiple images into wide field of view panoramas, which we covered in Chapter 9, 
allows us create photographs that could not be captured with a regular camera. This is just 
one instance of computational photography , where image analysis and processing algorithms 
are applied to one or more photographs to create images that go beyond the capabilities of 
traditional imaging systems. Some of these techniques are now being incorporated directly 
into digital still cameras. For example, some of the newer digital still cameras have sweep 
panorama modes and take multiple shots in low-light conditions to reduce image noise. 

In this chapter, we cover a number of additional computational photography algorithms. 
We begin with a review of photometric image calibration (Section 10. 1), i.e., the measurement 
of camera and lens responses, which is a prerequisite for many of the algorithms we describe 
later. We then discuss high dynamic range imaging (Section 10.2), which captures the full 
range of brightness in a scene through the use of multiple exposures (Figure 10.1a). We also 
discuss tone mapping operators , which map rich images back into regular display devices, 
such as screens and printers, as well as algorithms that merge flash and regular images to 
obtain better exposures (Figure 10.1b). 

Next, we discuss how the resolution of images can be improved either by merging mul- 
tiple photographs together or using sophisticated image priors (Section 10.3). This includes 
algorithms for extracting full-color images from the patterned Bayer mosaics present in most 
cameras. 

In Section 10.4, we discuss algorithms for cutting pieces of images from one photograph 
and pasting them into others (Figure 10.1c). In Section 10.5, we describe how to generate 
novel textures from real-world samples for applications such as filling holes in images (Fig- 
ure 10. Id). We close with a brief overview of non-photorealistic rendering (Section 10.5.2), 
which can turn regular photographs into artistic renderings that resemble traditional drawings 
and paintings. 

One topic that we do not cover extensively in this book is novel computational sensors, 
optics, and cameras. A nice survey can be found in an article by Nayar (2006), a recently 
published book by Raskar and Tumblin (2010), and more recent research papers (Levin, 
Fergus, Durand et al. 2007). Some related discussion can also be found in Sections 10.2 
and 13.3. 

A good general-audience introduction to computational photography can be found in the 
article by Hayes (2008) as well as survey papers by Nayar (2006), Cohen and Szeliski (2006), 
Levoy (2006), and Debevec (2006). 1 Raskar and Tumblin (2010) give extensive coverage of 
topics in this area, with particular emphasis on computational cameras and sensors. The 
sub-field of high dynamic range imaging has its own book discussing research in this area 
(Reinhard, Ward, Pattanaik et al. 2005), as well as a wonderful book aimed more at profes- 

1 See also the two special issue journals edited by Bimber (2006) and Durand and Szeliski (2007). 
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sional photographers (Freeman 2008). 2 A good survey of image matting is provided by Wang 
and Cohen (2007a). 

There are also several courses on computational photography where the instructors have 
provided extensive on-line materials, e.g., Fredo Durand’s Computation Photography course 
at MIT, 3 Alyosha Efros’ class at Carnegie Mellon, 4 Marc Levoy’s class at Stanford, 5 and a 
series of SIGGRAPH courses on Computational Photography. 6 


10.1 Photometric calibration 

Before we can successfully merge multiple photographs, we need to characterize the func- 
tions that map incoming irradiance into pixel values and also the amounts of noise present 
in each image. In this section, we examine three components of the imaging pipeline (Fig- 
ure 10.2) that affect this mapping. 

The first is the radiometric response function (Mitsunaga and Nayar 1999), which maps 
photons arriving at the lens into digital values stored in the image file (Section 10.1.1). The 
second is vignetting , which darkens pixel values near the periphery of images, especially at 
large apertures (Section 10.1.3). The third is the point spread function, which characterizes 
the blur induced by the lens, anti-aliasing filters, and finite sensor areas (Section 10. 1 .4). 7 The 
material in this section builds on the image formation processes described in Sections 2.2.3 
and 2.3.3, so if it has been a while since you looked at those sections, please go back and 
review them. 


10.1.1 Radiometric response function 

As we can see in Figure 10.2, a number of factors affect how the intensity of light arriving 
at the lens ends up being mapped into stored digital values. Let us ignore for now any non- 
uniform attenuation that may occur inside the lens, which we cover in Section 10.1.3. 

The first factors to affect this mapping are the aperture and shutter speed (Section 2.3), 
which can be modeled as global multipliers on the incoming light, most conveniently mea- 
sured in exposure values (log 2 brightness ratios). Next, the analog to digital (A/D) converter 
on the sensing chip applies an electronic gain, usually controlled by the ISO setting on your 
camera. While in theory this gain is linear, as with any electronics non-linearities may be 

2 Gulbins and Gulbins (2009) discuss related photographic techniques. 

3 MIT 6.815/6.865, http://stellar.mit.edU/S/course/6/sp08/6.815/materials.html. 

4 CMU 15-463. http://graphics.cs.cmu.edu/courses/15-463/. 

5 Stanford CS 448A, http://graphics.stanford.edu/courses/cs448a-10/. 

6 http://web.media.mit.edu/~raskar/photo/. 

7 Additional photometric camera and lens effects include sensor glare, blooming, and chromatic aberration, which 
can also be thought of as a spectrally varying form of geometric aberration (Section 2.2.3). 
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(a) 



RGB Gain Gamma & S -curve Q2 

(b) 


Figure 10.2 Image sensing pipeline: (a) block diagram showing the various sources of noise 
as well as the typical digital post-processing steps; (b) equivalent signal transforms, including 
convolution, gain, and noise injection. The abbreviations are: RD = radial distortion, AA = 
anti-aliasing filter, CFA = color filter array, Q1 and Q2 = quantization noise. 
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(a) 


(b) 


Figure 10.3 Radiometric response calibration: (a) typical camera response function, show- 
ing the mapping between incoming log irradiance (exposure) and output eight-bit pixel val- 
ues, for one color channel (Debevec and Malik 1997) © 1997 ACM; (b) color checker chart. 


present (either unintentionally or by design). Ignoring, for now, photon noise, on-chip noise, 
amplifier noise, and quantization noise, which we discuss shortly, you can often assume that 
the mapping between incoming light and the values stored in a RAW camera file (if your 
camera supports this) is roughly linear. 

If images are being stored in the more common JPEG format, the camera’s digital signal 
processor (DSP) next performs Bayer pattern demosaicing (Sections 2.3.2 and 10.3.1), which 
is a mostly linear (but often non-stationary) process. Some sharpening is also often applied at 
this stage. Next, the color values are multiplied by different constants (or sometimes a 3 x 3 
color twist matrix) to perform color balancing, i.e., to move the white point closer to pure 
white. Finally, a standard gamma is applied to the intensities in each color channel and the 
colors are converted into YCbCr format before being transformed by a DCT, quantized, and 
then compressed into the JPEG format (Section 2.3.3). Figure 10.2 shows all of these steps 
in pictorial form. 

Given the complexity of all of this processing, it is difficult to model the camera response 
function (Figure 10.3a), i.e., the mapping between incoming irradiance and digital RGB val- 
ues, from first principles. A more practical approach is to calibrate the camera by measuring 
correspondences between incoming light and final values. 

The most accurate, but most expensive, approach is to use an integrating sphere , which is 
a large (typically lm diameter) sphere carefully painted on the inside with white matte paint. 
An accurately calibrated light at the top controls the amount of radiance inside the sphere 
(which is constant everywhere because of the sphere’s radiometry) and a small opening at the 
side allows for a camera/lens combination to be mounted. By slowly varying the current going 
into the light, an accurate correspondence can be established between incoming radiance and 
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measured pixel values. The vignetting and noise characteristics of the camera can also be 
simultaneously determined. 

A more practical alternative is to use a calibration chart (Figure 10.3b) such as the Mac- 
beth or Munsell ColorChecker Chart . 8 The biggest problem with this approach is to ensure 
uniform lighting. One approach is to use a large dark room with a high-quality light source 
far away from (and perpendicular to) the chart. Another is to place the chart outdoors away 
from any shadows. (The results will differ under these two conditions, because the color of 
the illuminant will be different). 

The easiest approach is probably to take multiple exposures of the same scene while the 
camera is on a tripod and to recover the response function by simultaneously estimating the 
incoming irradiance at each pixel and the response curve (Mann and Picard 1995; Debevec 
and Malik 1997; Mitsunaga and Nayar 1999). This approach is discussed in more detail in 
Section 10.2 on high dynamic range imaging. 

If all else fails, i.e., you just have one or more unrelated photos, you can use an Interna- 
tional Color Consortium (ICC) profile for the camera (Fairchild 2005 ). 9 Even more simply, 
you can just assume that the response is linear if they are RAW files and that the images have 
a 7 = 2.2 non-linearity (plus clipping) applied to each RGB channel if they are JPEG images. 

10.1.2 Noise level estimation 

In addition to knowing the camera response function, it is also often important to know the 
amount of noise being injected under a particular camera setting (e.g., ISO/gain level). The 
simplest characterization of noise is a single standard deviation, usually measured in gray 
levels, independent of pixel value. A more accurate model can be obtained by estimating 
the noise level as a function of pixel value (Figure 10.4), which is known as the noise level 
function (Liu, Szeliski, Kang et al. 2008). 

As with the camera response function, the simplest way to estimate these quantities is in 
the lab, using either an integrating sphere or a calibration chart. The noise can be estimated 
either at each pixel independently, by taking repeated exposures and computing the temporal 
variance in the measurements (Healey and Kondepudy 1994), or over regions, by assuming 
that pixel values should all be the same within some region (e.g., inside a color checker 
square) and computing a spatial variance. 

This approach can be generalized to photos where there are regions of constant or slowly 
varying intensity (Liu, Szeliski, Kang el al. 2008). First, segment the image into such regions 
and fit a constant or linear function inside each region. Next, measure the (spatial) standard 
deviation of the differences between the noisy input pixels and the smooth fitted function 

8 http://www.xrite.com. 

5 See also the ICC Information on Profiles, http://www.color.org/info_profiles2.xalter. 
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Figure 10.4 Noise level function estimates obtained from a single color photograph (Liu, 
Szeliski, Kang el al. 2008) © 2008 IEEE. The colored curves are the estimated NLF fit as the 
probabilistic lower envelope of the measured deviations between the noisy piecewise-smooth 
images. The ground truth NLFs obtained by averaging 29 images are shown in gray. 


away from large gradients and region boundaries. Plot these as a function of output level for 
each color channel, as shown in Figure 10.4. Finally, fit a lower envelope to this distribution 
in order to ignore pixels or deviations that are outliers. A fully Bayesian approach to this 
problem that models the statistical distribution of each quantity is presented by (Liu, Szeliski, 
Kang el al. 2008). A simpler approach, which should produce useful results in most cases, 
is to fit a low-dimensional function (e.g., positive valued B-spline) to the lower envelope (see 
Exercise 10.2). 

In more recent work, Matsushita and Lin (2007) present a technique for simultaneously 
estimating a camera’s response and noise level functions based on skew (asymmetries) in 
level -dependent noise distributions. Their paper also contains extensive references to previ- 
ous work in these areas. 


10.1.3 Vignetting 

A common problem with using wide-angle and wide-aperture lenses is that the image tends 
to darken in the corners (Figure 10.5a). This problem is generally known as vignetting and 
comes in several different forms, including natural, optical, and mechanical vignetting (Sec- 
tion 2.2.3) (Ray 2002). As with radiometric response function calibration, the most accurate 
way to calibrate vignetting is to use an integrating sphere or a picture of a uniformly colored 
and illuminated blank wall. 

An alternative approach is to stitch a panoramic scene and to assume that the true radiance 
at each pixel comes from the central portion of each input image. This is easier to do if 
the radiometric response function is already known (e.g., by shooting in RAW mode) and 
if the exposure is kept constant. If the response function, image exposures, and vignetting 
function are unknown, they can still be recovered by optimizing a large least squares fitting 
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Figure 10.5 Single image vignetting correction (Zheng, Yu, Kang et al. 2008) © 2008 
IEEE: (a) original image with strong visible vignetting; (b) vignetting compensation as de- 
scribed by Zheng, Zhou, Georgescu et al. (2006); (c-d) vignetting compensation as described 
by Zheng, Yu, Kang et al. (2008). 



(a) (b) 


(c) (d) 

Figure 10.6 Simultaneous estimation of vignetting, exposure, and radiometric response 
(Goldman 2011) © 2011 IEEE: (a) original average of the input images; (b) after compen- 
sating for vignetting; (c) using gradient domain blending only (note the remaining mottled 
look); (d) after both vignetting compensation and blending. 


problem (Litvinov and Scheduler 2005; Goldman 2011). Figure 10.6 shows an example of 
simultaneously estimating the vignetting, exposure, and radiometric response function from 
a set of overlapping photographs (Goldman 2011). Note that unless vignetting is modeled 
and compensated, regular gradient-domain image blending (Section 9.3.4) will not create an 
attractive image. 

If only a single image is available, vignetting can be estimated by looking for slow con- 
sistent intensity variations in the radial direction. The original algorithm proposed by Zheng, 
Lin, and Kang (2006) first pre-segmented the image into smoothly varying regions and then 
performed an analysis inside each region. Instead of pre-segmenting the image, Zheng, Yu, 
Kang et al. (2008) compute the radial gradients at all the pixels and use the asymmetry in 
this distribution (since gradients away from the center are, on average, slightly negative) to 
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estimate the vignetting. Figure 10.5 shows the results of applying each of these algorithms 
to an image with a large amount of vignetting. Exercise 10.3 has you implement some of the 
above techniques. 


10.1.4 Optical blur (spatial response) estimation 

One final characteristic of imaging systems that you should calibrate is the spatial response 
function, which encodes the optical blur that gets convolved with the incoming image to pro- 
duce the point-sampled image. The shape of the convolution kernel, which is also known as 
point spread function (PSF) or optical transfer function, depends on several factors, including 
lens blur and radial distortion (Section 2.2.3), anti-aliasing filters in front of the sensor, and 
the shape and extent of each active pixel area (Section 2.3) (Figure 10.2). A good estimate of 
this function is required for applications such as multi-image super-resolution and de -blurring 
(Section 10.3). 

In theory, one could estimate the PSF by simply observing an infinitely small point light 
source everywhere in the image. Creating an array of samples by drilling through a dark plate 
and backlighting with a very bright light source is difficult in practice. 

A more practical approach is to observe an image composed of long straight lines or 
bars, since these can be fitted to arbitrary precision. Because the location of a horizontal 
or vertical edge can be aliased during acquisition, slightly slanted edges are preferred. The 
profile and locations of such edges can be estimated to sub-pixel precision, which makes it 
possible to estimate the PSF at sub-pixel resolutions (Reichenbach, Park, and Narayanswamy 
1991; Bums and Williams 1999; Williams and Burns 2001; Goesele, Fuchs, and Seidel 2003). 
The thesis by Murphy (2005) contains a nice survey of all aspects of camera calibration, 
including the spatial frequency response (SFR), spatial uniformity, tone reproduction, color 
reproduction, noise, dynamic range, color channel registration, and depth of field. It also 
includes a description of a slant-edge calibration algorithm called sfrmat2. 

The slant-edge technique can be used to recover a ID projection of the 2D PSF, e.g., 
slightly vertical edges are used to recover the horizontal line spread function (LSF) (Williams 
1999). The LSF is then often converted into the Fourier domain and its magnitude plotted as a 
one-dimensional modulation transfer function (MTF), which indicates which image frequen- 
cies are lost (blurred) and aliased during the acquisition process (Section 2.3.1). For most 
computational photography applications, it is preferable to directly estimate the full 2D PSF, 
since it can be hard to recover from its projections (Williams 1999). 

Figure 10.7 shows a pattern containing edges at all orientations, which can be used to 
directly recover a two-dimensional PSF. First, corners in the pattern are located by extracting 
edges in the sensed image, linking them, and finding the intersections of the circular arcs. 
Next, the ideal pattern, whose analytic form is known, is warped (using a homography) to 
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Figure 10.7 Calibration pattern with edges equally distributed at all orientations that can be 
used for PSF and radial distortion estimation (Joshi, Szeliski, and Kriegman 2008) © 2008 
IEEE. A portion of an actual sensed image is shown in the middle and a close-up of the ideal 
pattern is on the right. 


fit the central portion of the input image and its intensities are adjusted to fit the ones in 
the sensed image. If desired, the pattern can be rendered at a higher resolution than the input 
image, which enables the estimation of the PSF to sub-pixel resolution (Figure 10.8a). Finally 
a large linear least squares system is solved to recover the unknown PSF kernel K , 

K = argmin \\B — D(I * AT)|| 2 , (10.1) 

where B is the sensed (blurred) image, I is the predicted (sharp) image, and I) is an optional 
downsampling operator that matches the resolution of the ideal and sensed images (Joshi, 
Szeliski, and Kriegman 2008). In terms of the notation (3.75) introduced in Section 3.4.3, 
this could also be written as 


b = argmin ||o — D(s * 6)|| 2 , (10.2) 

b 

where o is the observed image, s is the sharp image, and b is the blur kernel. 

If the process of estimating the PSF is done locally in overlapping patches of the image, 
it can also be used to estimate the radial distortion and chromatic aberration induced by the 
lens (Figure 10.8b). Because the homography mapping the ideal target to the sensed image 
is estimated in the central (undistorted) part of the image, any (per-channel) shifts induced 
by the optics manifest themselves as a displacement in the PSF centers. 10 Compensating 
for these shifts eliminates both the achromatic radial distortion and the inter-channel shifts 
that result in visible chromatic aberration. The color-dependent blurring caused by chromatic 
aberration (Figure 2.21) can also be removed using the de-blurring techniques discussed in 

10 This process confounds the distinction between geometric and photometric calibration. In principle, any ge- 
ometric distortion could be modeled by spatially varying displaced PSFs. In practice, it is easier to fold any large 
shifts into the geometric correction component. 
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Figure 10.8 Point spread function estimation using a calibration target (Joshi, Szeliski, and 
Kriegman 2008) © 2008 IEEE, (a) Sub-pixel PSFs at successively higher resolutions (note 
the interaction between the square sensing area and the circular lens blur), (b) The radial 
distortion and chromatic aberration can also be estimated and removed, (c) PSF for a mis- 
focused (blurred) lens showing some diffraction and vignetting effects in the comers. 
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Section 10.3. Figure 10.8b shows how the radial distortion and chromatic aberration manifest 
themselves as elongated and displaced PSFs, along with the result of removing these effects 
in a region of the calibration target. 

The local 2D PSF estimation technique can also be used to estimate vignetting. Fig- 
ure 10.8c shows how the mechanical vignetting manifests itself as clipping of the PSF in the 
corners of the image. In order for the overall dimming associated with vignetting to be prop- 
erly captured, the modified intensities of the ideal pattern need to be extrapolated from the 
center, which is best done with a uniformly illuminated target. 

When working with RAW Bayer-pattern images, the correct way to estimate the PSF is 
to only evaluate the least squares terms in (10.1) at sensed pixel values, while interpolating 
the ideal image to all values. For JPEG images, you should linearize your intensities first, 
e.g., remove the gamma and any other non-linearities in your estimated radiometric response 
function. 

What if you have an image that was taken with an uncalibrated camera? Can you still 
recover the PSF an use it to correct the image? In fact, with a slight modification, the previous 
algorithms still work. 

Instead of assuming a known calibration image, you can detect strong elongated edges 
and fit ideal step edges in such regions (Figure 10.9b), resulting in the sharp image shown 
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Figure 10.9 Estimating the PSF without using a calibration pattern (Joshi, Szeliski, and 
Kriegman 2008) © 2008 IEEE: (a) Input image with blue cross-section (profile) location, (b) 
Profile of sensed and predicted step edges, (c-d) Locations and values of the predicted colors 
near the edge locations. 


in Figure 10. 9d. For every pixel that is surrounded by a complete set of valid estimated 
neighbors (green pixels in Figure 10.9c), apply the least squares formula (10.1) to estimate 
the kernel K. The resulting locally estimated PSFs can be used to correct for chromatic 
aberration (since the relative displacements between per-channel PSFs can be computed), as 
shown by Joshi, Szeliski, and Kriegman (2008). 

Exercise 10.4 provides some more detailed instructions for implementing and testing 
edge-based PSF estimation algorithms. An alternative approach, which does not require the 
explicit detection of edges but uses image statistics (gradient distributions) instead, is pre- 
sented by Fergus, Singh, Hertzmann el al. (2006). 


10.2 High dynamic range imaging 

As we mentioned earlier in this chapter, registered images taken at different exposures can be 
used to calibrate the radiometric response function of a camera. More importantly, they can 
help you create well-exposed photographs under challenging conditions, such as brightly lit 
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Figure 10.10 Sample indoor image where the areas outside the window are overexposed 
and inside the room are too dark. 

1 1,500 25,000 400,000 

Figure 10.11 Relative brightness of different scenes, ranging from 1 inside a dark room lit 
by a monitor to 2,000,000 looking at the sun. Photos courtesy of Paul Debevec. 




scenes where any single exposure contains saturated (overexposed) and dark (underexposed) 
regions (Figure 10.10). This problem is quite common, because the natural world contains a 
range of radiance values that is far greater than can be captured with any photographic sensor 
or film (Figure 10.11). Taking a set of bracketed exposures (exposures taken by a camera 
in automatic exposure bracketing (AEB) mode to deliberately under- and over-expose the 
image) gives you the material from which to create a properly exposed photograph, as shown 
in Figure 10.12 (Reinhard, Ward, Pattanaik et al. 2005; Freeman 2008; Gulbins and Gulbins 
2009; Hasinoff, Durand, and Freeman 2010). 

While it is possible to combine pixels from different exposures directly into a final com- 



Figure 10.12 A bracketed set of shots (using the camera’s automatic exposure bracketing 
(AEB) mode) and the resulting high dynamic range (HDR) composite. 
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posite (Burt and Kolczynski 1993; Mertens, Kautz, and Reeth 2007), this approach runs the 
risk of creating contrast reversals and halos. Instead, the more common approach is to pro- 
ceed in three stages: 

1. Estimate the radiometric response function from the aligned images. 

2. Estimate a radiance map by selecting or blending pixels from different exposures. 

3. Tone map the resulting high dynamic range (HDR) image back into a displayable 
gamut. 

The idea behind estimating the radiometric response function is relatively straightforward 
(Mann and Picard 1995; Debevec and Malik 1997; Mitsunaga and Nayar 1999; Reinhard, 
Ward, Pattanaik el al. 2005). Suppose you take three sets of images at different exposures 
(shutter speeds), say at ±2 exposure values. 11 If we were able to determine the irradiance 
(exposure) Ei at each pixel (2.101), we could plot it against the measured pixel value z i3 for 
each exposure time tj, as shown in Figure 10.13. 

Unfortunately, we do not know the irradiance values Ei, so these have to be estimated 
at the same time as the radiometric response function /, which can be written (Debevec and 
Malik 1997) as 

z ij = f{Ei tj), (10.3) 

where tj is the exposure time for the j th image. The inverse response curve / -1 is given by 

f~ 1 (z ij ) = E i t j . (10.4) 

Taking logarithms of both sides (base 2 is convenient, as we can now measure quantities in 
EVs), we obtain 

g{zij) = log f" l (zij) = log Ei + log tj , (10.5) 

where g = log/ -1 (which maps pixel values z r:t into log irradiance) is the curve we are 
estimating (Figure 10.13 turned on its side). 

Debevec and Malik (1997) assume that the exposure times tj are known. (Recall that 
these can be obtained from a camera’s EXIF tags, but that they actually follow a power of 2 
progression . . . , ] / 64 , y 3 2 , Vie, Vs, • ■ ■ instead of the marked . . . , Vi25, Veo, V-30, 

Vis, Vs, • ■ • values — see Exercise 2.5.) The unknowns are therefore the per-pixel exposures 
Ei and the response values g^ = g(k), where g can be discretized according to the 256 
pixel values commonly observed in eight-bit images. (The response curves are calibrated 
separately for each color channel.) 

11 Changing the shutter speed is preferable to changing the aperture, as the latter can modify the vignetting and 
focus. Using d=2 “f-stops” (technically, exposure values, or EVs, since f-stops refer to apertures) is usually the right 
compromise between capturing a good dynamic range and having properly exposed pixels everywhere. 
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Figure 10.13 Radiometric calibration using multiple exposures (Debevec and Malik 1997). 
Corresponding pixel values are plotted as functions of log exposures (irradiance). The curves 
on the left are shifted to account for each pixel’s unknown radiance until they all line up into 
a single smooth curve. 


In order to make the response curve smooth, Debevec and Malik (1997) add a second- 
order smoothness constraint 


A E 9'\k) 2 = A - 1) - 2 g(k) + g(k + l)] 2 , (10.6) 

k 

which is similar to the one used in snakes (5.3). Since pixel values are more reliable in the 
middle of their range (and the g function becomes singular near saturation values), they also 
add a weighting (hat) function w(k) that decays to zero at both ends of the pixel value range. 


w(z ) 


Z ^min 
■^max ^ 


Z — (^min 
z > (Anin 


2 ma x )/2 

T ^max)/2- 


(10.7) 


Putting all of these terms together, they obtain a least squares problem in the unknowns 
{g k } and {Ei}, 

E = ^2'^2w(z iJ )[g(z itj ) - log Ei - log tj] 2 + A ^w(%"0) 2 - (10.8) 

i j k 

(In order to remove the overall shift ambiguity in the response curve and irradiance values, 
the middle of the response curve is set to 0.) Debevec and Malik (1997) show how this can 
be implemented in 21 lines of MATLAB code, which partially accounts for the popularity of 
their technique. 

While Debevec and Malik (1997) assume that the exposure times tj are known exactly, 
there is no reason why these additional variables cannot be thrown into the least squares 
problem, constraining their final estimated values to lie close to their nominal values tj with 
an extra term g JW (tj — tj) 2 . 
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Figure 10.14 Recovered response function and radiance image for a real digital camera 
(DCS460) (Debevec and Malik 1997) © 1997 ACM. 


Figure 10. 14 shows the recovered radiometric response function for a digital camera along 
with select (relative) radiance values in the overall radiance map. Figure 10.15 shows the 
bracketed input images captured on color film and the corresponding radiance map. 

While Debevec and Malik (1997) use a general second-order smooth curve g to parame- 
terize their response curve, Mann and Picard (1995) use a three-parameter function 

f(E) = a + 0ET, (10.9) 

while Mitsunaga and Nayar (1999) use a low-order (TV < 10) polynomial for the inverse 
response function g. Pal, Szeliski, Uyttendaele et al. (2004) derive a Bayesian model that 
estimates an independent smooth response function for each image, which can better model 
the more sophisticated (and hence less predictable) automatic contrast and tone adjustment 
performed in today’s digital cameras. 

Once the response function has been estimated, the second step in creating high dynamic 
range photographs is to merge the input images into a composite radiance map. If the re- 
sponse function and images were known exactly, i.e., if they were noise free, you could use 
any non-saturated pixel value to estimate the corresponding radiance by mapping it through 
the inverse response curve E = g(z). 

Unfortunately, pixels are noisy, especially under low-light conditions when fewer photons 
arrive at the sensor. To compensate for this, Mann and Picard (1995) use the derivative of 
the response function as a weight in determining the final radiance estimate, since “flatter” 
regions of the curve tell us less about the incoming irradiance. Debevec and Malik (1997) 
use a hat function (10.7) which accentuates mid-tone pixels while avoiding saturated values. 
Mitsunaga and Nayar (1999) show that in order to maximize the signal-to-noise ratio (SNR), 
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Figure 10.15 Bracketed set of exposures captured with a film camera and the resulting 
radiance image displayed in pseudocolor (Debevec and Malik 1997) © 1997 ACM. 




Figure 10.16 Merging multiple exposures to create a high dynamic range composite (Kang, 
Uyttendaele, Winder el al. 2003): (a-c) three different exposures; (d) merging the exposures 
using classic algorithms (note the ghosting due to the horse’s head movement); (e) merging 
the exposures with motion compensation. 
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Figure 10.17 HDR merging with large amounts of motion (Eden, Uyttendaele, and Szeliski 
2006) © 2006 IEEE: (a) registered bracketed input images; (b) results after the first pass of 
image selection: reference labels, image, and tone-mapped image; (c) results after the second 
pass of image selection: final labels, compressed HDR image, and tone-mapped image 


the weighting function must emphasize both higher pixel values and larger gradients in the 
transfer function, i.e., 

w(z) = g(z)/g'(z), (10.10) 

where the weights w are used to form the final irradiance estimate 


log Ei 


E f - log© 

E j w (zij) 


( 10 . 11 ) 


Exercise 10.1 has you implement one of the radiometric response function calibration tech- 
niques and then use it to create radiance maps. 

Under real-world conditions, casually acquired images may not be perfectly registered 
and may contain moving objects. Ward (2003) uses a global (parametric) transform to align 
the input images, while Kang, Uyttendaele, Winder et al. (2003) present an algorithm that 
combines global registration with local motion estimation (optical flow) to accurately align 
the images before blending their radiance estimates (Figure 10.16). Since the images may 
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Figure 10.18 Fuji SuperCCD high dynamic range image sensor. The paired large and small 
active areas provide two different effective exposures. 


have widely different exposures, care must be taken when estimating the motions, which must 
themselves be checked for consistency to avoid the creation of ghosts and object fragments. 

Even this approach, however, may not work when the camera is simultaneously undergo- 
ing large panning motions and exposure changes, which is a common occurrence in casually 
acquired panoramas. Under such conditions, different parts of the image may be seen at one 
or more exposures. Devising a method to blend all of these different sources while avoid- 
ing sharp transitions and dealing with scene motion is a challenging problem. One approach 
is to first find a consensus mosaic and to then selectively compute radiances in under- and 
over-exposed regions (Eden, Uyttendaele, and Szeliski 2006), as shown in Figure 10.17. 

Recently, some cameras, such as the Sony ct550 and Pentax K-7, have started integrating 
multiple exposure merging and tone mapping directly into the camera body. In the future, 
the need to compute high dynamic range images from multiple exposures may be eliminated 
by advances in camera sensor technology (Figure 10.18) (Yang, El Gamal, Fowler et al. 
1999; Nayar and Mitsunaga 2000; Nayar and Branzoi 2003; Kang, Uyttendaele, Winder et 
al. 2003; Narasimhan and Nayar 2005; Tumblin, Agrawal, and Raskar 2005). However, the 
need to blend such images and to tone map them to lower-gamut displays is likely to remain. 


HDR image formats. Before we discuss techniques for mapping HDR images back to a 
displayable gamut, we should discuss the commonly used formats for storing HDR images. 

If storage space is not an issue, storing each of the R, G, and B values as a 32-bit IEEE 
float is the best solution. The commonly used Portable PixMap (.ppm) format, which supports 
both uncompressed ASCII and raw binary encodings of values, can be extended to a Portable 
FloatMap (.pfm) format by modifying the header. TIFF also supports full floating point 
values. 

A more compact representation is the Radiance format (.pic, ,hdr) (Ward 1994), which 
uses a single common exponent and per-channel mantissas (10.19b). An intermediate encod- 
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Figure 10.19 HDR image encoding formats: (a) Portable PixMap (.ppm); (b) Radiance 
(.pic, .hdr); (c) OpenEXR (.exr). 

ing, OpenEXR from ILM, 12 uses 16-bit floats for each channel (10.19c), which is a format 
supported natively on most modern GPUs. Ward (2004) describes these and other data for- 
mats such as LogLuv (Larson 1998) in more detail, as do the books by Reinhard, Ward, 
Pattanaik et al. (2005) and Freeman (2008). An even more recent HDR image format is the 
JPEG XR standard. 13 

10.2.1 Tone mapping 

Once a radiance map has been computed, it is usually necessary to display it on a lower gamut 
(i.e., eight-bit) screen or printer. A variety of tone mapping techniques has been developed for 
this purpose, which involve either computing spatially varying transfer functions or reducing 
image gradients to fit the available dynamic range (Reinhard, Ward, Pattanaik et al. 2005). 

The simplest way to compress a high dynamic range radiance image into a low dynamic 
range gamut is to use a global transfer curve (Larson, Rushmeier, and Piatko 1997). Fig- 
ure 10.20 shows one such example, where a gamma curve is used to map an HDR image back 

12 http://www.openexr.net/. 

13 http://www.itu.int/rec/T-REC-T.832-200903-I/en. 
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(a) (b) (c) 


Figure 10.20 Global tone mapping: (a) input HDR image, linearly mapped; (b) gamma 
applied to each color channel independently; (c) gamma applied to intensity (colors are 
less washed out). Original HDR image courtesy of Paul Debevec, http://ict.debevec.org/ 
~debevec/Research/HDR/. Processed images courtesy of Fredo Durand, MIT 6.815/6.865 
course on Computational Photography. 


into a displayable gamut. If gamma is applied separately to each channel (Figure 10.20b), the 
colors become muted (less saturated), since higher- valued color channels contribute less (pro- 
portionately) to the final color. Splitting the image up into its luminance and chrominance 
(say, L*a*b*) components (Section 2.3.2), applying the global mapping to the luminance 
channel, and then reconstituting a color image works better (Figure 10.20c). 

Unfortunately, when the image has a really wide range of exposures, this global approach 
still fails to preserve details in regions with widely varying exposures. What is needed, in- 
stead, is something akin to the dodging and burning performed by photographers in the dark- 
room. Mathematically, this is similar to dividing each pixel by the average brightness in a 
region around that pixel. 

Figure 10.21 shows how this process works. As before, the image is split into its lumi- 
nance and chrominance channels. The log luminance image 

H (x, y ) = log L(x, y) (10. 12) 

is then low-pass filtered to produce a base layer 

Hh{x,y) = B(x,y) * H(x,y), (10.13) 

and a high-pass detail layer 

Hu{x,y) = H(x,y) - H h (x,y). (10.14) 

The base layer is then contrast reduced by scaling to the desired log-luminance range. 


Hn (x,y) = sH R (x,y) 


(10.15) 
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and added to the detail layer to produce the new log-luminance image 


I(x, y) = Hn(x, y) + H h (x, y), 


(10.16) 


which can then be exponentiated to produce the tone-mapped (compressed) luminance im- 
age. Note that this process is equivalent to dividing each luminance value by (a monotonic 
mapping of) the average log-luminance value in a region around that pixel. 

Figure 10.21 shows the low-pass and high-pass log luminance image and the resulting 
tone-mapped color image. Note how the detail layer has visible halos around the high- 
contrast edges, which are visible in the final tone-mapped image. This is because linear 
filtering, which is not edge preserving, produces halos in the detail layer (Figure 10.23). 

The solution to this problem is to use an edge-preserving filter to create the base layer. Du- 
rand and Dorsey (2002) study a number of such edge-preserving filters, including anisotropic 
and robust anisotropic diffusion, and select bilateral filtering (Section 3.3.1) as their edge- 
preserving filter. (A more recent paper by Farbman, Fattal, Lischinski et al. (2008) argues 
in favor of using a weighted least squares (WLF) filter as an alternative to the bilateral filter 
and Paris, Kornprobst, Tumblin et al. (2008) reviews bilateral filtering and its applications 
in computer vision and computational photography.) Figure 10.22 shows how replacing the 
linear low-pass filter with a bilateral filter produces tone-mapped images with no visible ha- 
los. Figure 10.24 summarizes the complete information flow in this process, starting with 
the decomposition into log luminance and chrominance images, bilateral filtering, contrast 
reduction, and re-composition into the final output image. 

An alternative to compressing the base layer is to compress its derivatives, i.e., the gra- 
dient of the log-luminance image (Fattal, Lischinski, and Werman 2002). Figure 10.25 illus- 
trates this process. The log-luminance image is differentiated to obtain a gradient image 


This gradient image is then attenuated by a spatially varying attenuation function $(at, y). 


The attenuation function /( x, y) is designed to attenuate large-scale brightness changes (Fig- 
ure 10.26a) and is designed to take into account gradients at different spatial scales (Fattal, 
Lischinski, and Werman 2002). 

After attenuation, the resulting gradient field is re-integrated by solving a first-order vari- 
ational (least squares) problem. 


H'(x,y) = VH(x,y). 


(10.17) 


G(x,y) = H'(x,y)$(x,y). 


(10.18) 



(10.19) 
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(a) (b) 


Figure 10.21 Local tone mapping using linear filters: (a) low-pass and high-pass filtered log 
luminance images and color (chrominance) image; (b) resulting tone-mapped image (after at- 
tenuating the low-pass log luminance image) shows visible halos around the trees. Processed 
images courtesy of Fredo Durand, MIT 6.815/6.865 course on Computational Photography. 



(a) (b) 


Figure 10.22 Local tone mapping using bilateral filter (Durand and Dorsey 2002): (a) low- 
pass and high-pass bilateral filtered log luminance images and color (chrominance) image; 
(b) resulting tone-mapped image (after attenuating the low-pass log luminance image) shows 
no halos. Processed images courtesy of Fredo Durand, MIT 6.815/6.865 course on Compu- 
tational Photography. 
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Figure 10.23 Gaussian vs. bilateral filtering (Petschnigg, Agrawala, Hoppe el al. 2004) © 
2004 ACM: A Gaussian low-pass filter blurs across all edges and therefore creates strong 
peaks and valleys in the detail image that cause halos. The bilateral filter does not smooth 
across strong edges and thereby reduces halos while still capturing detail. 




Figure 10.24 Local tone mapping using bilateral filter (Durand and Dorsey 2002): sum- 
mary of algorithm workflow. Images courtesy of Fredo Durand, MIT 6.815/6.865 course on 
Computational Photography. 



492 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



7.5:1 




Figure 10.25 Gradient domain tone mapping (Fattal, Lischinski, and Werman 2002) © 
2002 ACM. The original image with a dynamic range of 2415:1 is first converted into the log 
domain, H(x), and its gradients are computed, H'(x). These are attenuated (compressed) 
based on local contrast, G(x), and integrated to produce the new logarithmic exposure image 
/( x), which is exponentiated to produce the final intensity image, whose dynamic range is 
7.5:1. 


to obtain the compressed log-luminance image I[x,y). This least squares problem is the 
same that was used for Poisson blending (Section 9.3.4) and was first introduced in our study 
of regularization (Section 3.7.1, 3.100). It can efficiently be solved using techniques such 
as multigrid and hierarchical basis preconditioning (Fattal, Lischinski, and Werman 2002; 
Szeliski 2006b; Farbman, Fattal, Lischinski el al. 2008). Once the new luminance image has 
been computed, it is combined with the original color image using 


Cent — 



( 10 . 20 ) 


where C — ( R, G , B) and L\ n and L out are the original and compressed luminance images. 
The exponent s controls the saturation of the colors and is typically in the range s £ [0.4, 0.6]. 
Figure 10.26b shows the final tone-mapped color image, which shows no visible halos despite 
the extremely large variation in input radiance values. 

Yet another alternative to these two approaches is to perform the local dodging and burn- 
ing using a locally scale-selective operator (Reinhard, Stark, Shirley el al. 2002). Figure 10.27 
shows how such a scale selection operator can determine a radius (scale) that only includes 
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(a) (b) 


Figure 10.26 Gradient domain tone mapping (Fattal, Lischinski, and Werman 2002) © 
2002 ACM: (a) attenuation map, with darker values corresponding to more attenuation; (b) 
final tone-mapped image. 


similar color values within the inner circle while avoiding much brighter values in the sur- 
rounding circle. In practice, a difference of Gaussians normalized by the inner Gaussian 
response is evaluated over a range of scales, and the largest scale whose metric is below a 
threshold is selected (Reinhard, Stark, Shirley et al. 2002). 

What all of these techniques have in common is that they adaptively attenuate or brighten 
different regions of the image so that they can be displayed in a limited gamut without loss of 
contrast. Lischinski, Farbman, Uyttendaele el al. (2006b) introduce an interactive technique 
that performs this operation by interpolating a set of sparse user-drawn adjustments (strokes 
and associated exposure value corrections) to a piecewise-continuous exposure correction 
map (Figure 10.28). The interpolation is performed by minimizing a locally weighted least 
square (WLS) variational problem. 


min 


m(x,y)\\f(x,y) - g(x,y)\\ 2 dxdy + X 


w s {x,y)\\Vf(x iy )\\ 2 dxdy, 

( 10 . 21 ) 


where g(x, y) and f(x, y) are the input and output log exposure (attenuation) maps (Fig- 
ure 10.28). The data weighting term w^(x,y) is 1 at stroke locations and 0 elsewhere. The 
smoothness weighting term w s (x, y) is inversely proportional to the log-luminance gradient. 


1 

" I VfT||“ + e 


( 10 . 22 ) 


and hence encourages the f(x, y) map to be smoother in low-gradient areas than along high- 
gradient discontinuities. 14 The same approach can also be used for fully automated tone map- 

14 In practice, the x and y discrete derivatives are weighted separately (Lischinski. Farbman, Uyttendaele el al. 
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Figure 10.27 Scale selection for tone mapping (Reinhard, Stark, Shirley et al. 2002) © 
2002 ACM. 

ping by setting target exposure values at each pixel and allowing the weighted least squares 
to convert these into piecewise smooth adjustment maps. 

The weighted least squares algorithm, which was originally developed for image coloriza- 
tion applications (Levin, Lischinski, and Weiss 2004), has recently been applied to general 
edge-preserving smoothing in applications such as contrast enhancement (Bae, Paris, and 
Durand 2006) and tone mapping (Farbman, Fattal, Lischinski et al. 2008) where the bilateral 
filtering was previously used. It can also be used to perform HDR merging and tone mapping 
simultaneously (Raman and Chaudhuri 2007, 2009). 

Given the wide range of locally adaptive tone mapping algorithms that have been devel- 
oped, which ones should be used in practice? Freeman (2008) provides a great discussion 
of commercially available algorithms, their artifacts, and the parameters that can be used to 
control them. He also has a wealth of tips for HDR photography and workflow. I highly rec- 
ommend his book for anyone contemplating additional research (or personal photography) in 
this area. 


10.2.2 Application : Flash photography 

While high dynamic range imaging combines images of a scene taken at different exposures, 
it is also possible to combine flash and non-flash images to achieve better exposure and color 
balance and to reduce noise (Eisemann and Durand 2004; Petschnigg, Agrawala, Hoppe et 
al. 2004). 

The problem with flash images is that the color is often unnatural (it fails to capture the 
ambient illumination), there may be strong shadows or specularities, and there is a radial 
falloff in brightness away from the camera (Figures 10.1b and 10.29a). Non-flash photos 


2006b). Their default parameter settings are A = 0.2, a = 1, and e = 0.0001. 
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(a) (b) 


Figure 10.28 Interactive local tone mapping (Lischinski, Farbman, Uyttendaele et al. 2006b) 
© 2006 ACM: (a) user-drawn strokes with associated exposure values g(x, y ) (b) correspond- 
ing piecewise-smooth exposure adjustment map f(x,y). 



(a) (b) (c) (d) 


Figure 10.29 Detail transfer in flash/no-flash photography (Petschnigg, Agrawala, Hoppe et 
al. 2004) © 2004 ACM: (a) details of input ambient A and flash F images; (b) joint bilaterally 
filtered no-flash image A NR ; (c) detail layer F Detml computed from the flash image F; (d) 
final merged image A Fmal . 
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taken under low light conditions often suffer from excessive noise (because of the high ISO 
gains and low photon counts) and blur (due to longer exposures). Is there some way to 
combine a non-flash photo taken just before the flash goes off with the flash photo to produce 
an image with good color values, sharpness, and low noise? 15 

Petschnigg, Agrawala, Hoppe et al. (2004) approach this problem by first filtering the no- 
flash (ambient) image A with a variant of the bilateral filter called the joint bilateral filter 16 
in which the range kernel (3.36) 


r(i,j,k,l) = exp 


II/(m)-/(M)II 2 

2 of 


(10.23) 


is evaluated on the flash image F instead of the ambient image A, since the flash image is less 
noisy and hence has more reliable edges (Figure 10.29b). Because the contents of the flash 
image can be unreliable inside and at the boundaries of shadows and specularities, these are 
detected and a regular bilaterally filtered image A Base is used instead (Figure 10.30). 

The second stage of their algorithm computes a flash detail image 


-^Detail 


F + e 

pBase _|_ £ * 


(10.24) 


where F Base is abilaterally filtered version of the flash image F and e = 0.02. This detail im- 
age (Figure 10.29c) encodes details that may have been filtered away from the noise-reduced 
no-flash image A NR , as well as additional details created by the flash camera, which often 
add crispness. The detail image is used to modulate the noise-reduced ambient image A NR 
to produce the final results 


y^Final 


(1 — M'jA NR p Detail _|_ MA Base 


(10.25) 


shown in Figures 10.1b and 10.29d. 

Eisemann and Durand (2004) present an alternative algorithm that shares some of the 
same basic concepts. Both papers are well worth reading and contrasting (Exercise 10.6). 

Flash images can also be used for a variety of additional applications such as extracting 
more reliable foreground mattes of objects (Raskar, Tan, Feris et al. 2004; Sun, Li, Kang et al. 
2006). Flash photography is just one instance of the more general topic of active illumination, 
which is discussed in more detail by Raskar and Tumblin (2010). 


15 In fact, the discontinued FujiFilm FinePix F40fd camera takes a pair of flash and no flash images in quick 
succession; however, it only lets you decide to keep one of them. 

16 Eisemann and Durand (2004) call this the cross bilateral jitter. 
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Figure 10.30 Flash/no-flash photography algorithm (Petschnigg, Agrawala, Hoppe at al. 
2004) © 2004 ACM. The ambient (no-flash) image A is filtered with a regular bilateral filter 
to produce A Base , which is used in shadow and specularity regions, and a joint bilaterally 
filtered noise reduced image A NR . The flash image F is bilaterally filtered to produce a 
base image F Base and a detail (ratio) image p Detal \ which is used to modulate the de- 
noised ambient image. The shadow/specularity mask M is computed by comparing linearized 
versions of the flash and no-flash images. 


10.3 Super-resolution and blur removal 

While high dynamic range imaging enables us to obtain an image with a larger dynamic 
range than a single regular image, super-resolution enables us to create images with higher 
spatial resolution and less noise than regular camera images (Chaudhuri 2001; Park, Park, 
and Kang 2003; Capel and Zisserman 2003; Capel 2004; van Ouwerkerk 2006). Most com- 
monly, super- resolution refers to the process of aligning and combining several input images 
to produce such high-resolution composites (Irani and Peleg 1991; Cheeseman, Kanefsky, 
Hanson et al. 1993; Pickup, Capel, Roberts at al. 2009). However, some newer techniques 
can super-resolve a single image (Freeman, Jones, and Pasztor 2002; Baker and Kanade 2002; 
Fattal 2007) and are hence closely related to techniques for removing blur (Sections 3.4.3 and 
3.4.4). 

The most principled way to formulate the super-resolution problem is to write down the 
stochastic image formation equations and image priors and to then use Bayesian inference to 
recover the super-resolved (original) sharp image. We can do this by generalizing the image 
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formation equations (3.75) used for image deblurring (Section 3.4.3), which we also used 
in (10.2) for blur kernel (PSF) estimation (Section 10.1.4). In this case, we have several ob- 
served images {ofc(a:)}, as well as an image warping function h k ( x) for each observed image 
(Figure 3.47). Combining all of these elements, we get the (noisy) observation equations 17 

Ofc( x) = D{b(x) * s(h k (x))} + n k {x), (10.26) 

where D is the downsampling operator, which operates after the super-resolved (sharp) 
warped image s(h k (x)) has been convolved with the blur kernel b(x). The above image 
formation equations lead to the following least squares problem, 

X! IM*) “ D ih{x) * s(h k (x))}\\ 2 . (10.27) 

k 

In most super-resolution algorithms, the alignment (warping) h k is estimated using one of 
the input frames as the reference frame; either feature-based (Section 6.1.3) or direct (image- 
based) (Section 8.2) parametric alignment techniques can be used. (A few algorithms, such 
as those described by Schultz and Stevenson (1996) or Capel (2004) use dense (per-pixel 
flow) estimates.) A better approach is to re-compute the alignment by directly minimizing 
(10.27) once an initial estimate of s(x) has been computed (Hardie, Barnard, and Armstrong 
1997) or to marginalize out the motion parameters altogether (Pickup, Capel, Roberts et al. 
2007) — see also the work of Protter and Elad (2009) for some related video super-resolution 
work. 

The point spread function (blur kernel) b k is either inferred from knowledge of the image 
formation process (e.g., the amount of motion or defocus blur and the camera sensor optics) 
or calibrated from a test image or the observed images {o k } using one of the techniques 
described in Section 10.1.4. The problem of simultaneously inferring the blur kernel and the 
sharp image is known as blind image deconvolution (Kundur and Hatzinakos 1996; Levin 
2006). 18 

Given an estimate of h k and b k (x), (10.27) can be re-written using matrix/vector notation 
as a large sparse least squares problem in the unknown values of the super-resolved pixels s , 

J2\\o k -DB k W k s\\ 2 . (10.28) 

k 

17 It is also possible to add an unknown bias-gain term to each observation (Capel 2004), as was done for motion 
estimation in (8.8). 

18 Notice that there is a chicken-and-egg problem if both the blur kernel and the super-resolved image are un- 
known. This can be “broken” either using structural assumptions about the sharp image, e.g., the presence of edges 
(Joshi, Szeliski, and Kriegman 2008) or prior models for the image, such as edge sparsity (Fergus, Singh, Hertzmann 
et al 2006). 
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(Recall from (3.89) that once the warping function hk is known, values of s(hk(x)) depend 
linearly on those in s(a;).) An efficient way to solve this least squares problem is to use 


preconditioned conjugate gradient descent (Capel 2004), although some earlier algorithms, 
such as the one developed by Irani and Peleg (1991), used regular gradient descent (also 
known as iterative back projection (IBP), in the computed tomography literature). 

The above formulation assumes that warping can be expressed as a simple (sine or bicu- 
bic) interpolated resampling of the super-resolved sharp image, followed by a stationary 
(spatially invariant) blurring (PSF) and area integration process. However, if the surface is 
severely foreshortened, we have to take into account the spatially varying filtering that occurs 
during the image warping (Section 3.6.1), before we can then model the PSF induced by the 
optics and camera sensor (Wang, Kang, Szeliski et al. 2001; Capel 2004). 

How well does this least squares (MLE) approach to super-resolution work? In practice, 
this depends a lot on the amount of blur and aliasing in the camera optics, as well as the accu- 
racy in the motion and PSF estimates (Baker and Kanade 2002; Jiang, Wong, and Bao 2003; 
Capel 2004). Less blurring and more aliasing means that there is more (aliased) high fre- 
quency information available to be recovered. However, because the least squares (maximum 
likelihood) formulation uses no image prior, a lot of high-frequency noise can be introduced 
into the solution (Figure 10.31c). 

For this reason, most super-resolution algorithms assume some form of image prior. The 
simplest of these is to place a penalty on the image derivatives similar to Equations (3.105 
and 3.113), e.g., 


As discussed in Section 3.7.2, when p p is quadratic, this is a form of Tikhonov regulariza- 
tion (Section 3.7.1), and the overall problem is still linear least squares. The resulting prior 
image model is a Gaussian Markov random field (GMRF), which can be extended to other 
(e.g., diagonal) differences, as in (Capel 2004) (Figure 10.31). 

Unfortunately, GMRFs tend to produce solutions with visible ripples, which can also 
be interpreted as increased noise sensitivity in middle frequencies (Exercise 3.17). A bet- 
ter image prior is a robust prior that encourages piecewise continuous solutions (Black and 
Rangarajan 1996), see Appendix B.3. Examples of such priors include the Huber potential 
(Schultz and Stevenson 1996; Capel and Zisserman 2003), which is a blend of a Gaussian 
with a longer-tailed Laplacian, and the even sparser (heavier-tailed) hyper-Laplacians used 
by Levin, Fergus, Durand el al. (2007) and Krishnan and Fergus (2009). It is also possible to 
learn the parameters for such priors using cross-validation (Capel 2004; Pickup 2007). 

While sparse (robust) derivative priors can reduce rippling effects and increase edge 
sharpness, they cannot hallucinate higher-frequency texture or details. To do this, a train- 
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Figure 10.31 Super-resolution results using a variety of image priors (Capel 2001): (a) Low- 
res ROI (bicubic 3x zoom); (b) average image; (c) MLE @ 1.25x pixel-zoom; (d) simple 
||x|| 2 prior (A = 0.004); (e) GMRF (A = 0.003); (f) HMRF (A = 0.01, a = 0.04). 10 
images are used as input and a 3 x super-resolved image is produced in each case, except for 
the MLE result in (c). 



Figure 10.32 Example-based super-resolution: (a) original 32 x 32 low-resolution image; 
(b) example-based super-resolved 256 x 256 image (Freeman, Jones, and Pasztor 2002) © 
2002 IEEE; (c) upsampling via imposed edge statistics (Fattal 2007) © 2007 ACM. 
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ing set of sample images can be used to find plausible mappings between low-frequency 
originals and the missing higher frequencies. Inspired by some of the example-based texture 
synthesis algorithms we discuss in Section 10.5, the example-based super-resolution algo- 
rithm developed by Freeman, Jones, and Pasztor (2002) uses training images to learn the 
mapping between local texture patches and missing higher-frequency details. To ensure that 
overlapping patches are similar in appearance, a Markov random field is used and optimized 
using either belief propagation (Freeman, Pasztor, and Carmichael 2000) or a raster-scan de- 
terministic variant (Freeman, Jones, and Pasztor 2002). Figure 10.32 shows the results of 
hallucinating missing details using this approach and compares these results to a more recent 
algorithm by Fattal (2007). This latter algorithm learns to predict oriented gradient magni- 
tudes in the finer resolution image based on a pixel’s location relative to the nearest detected 
edge along with the corresponding edge statistics (magnitude and width). It is also possible 
to combine sparse (robust) derivative priors with example-based super-resolution, as shown 
by Tappen, Russell, and Freeman (2003). 

An alternative (but closely related) form of hallucination is to recognize the parts of a 
training database of images to which a low-resolution pixel might correspond. In their work. 
Baker and Kanade (2002) use local derivative-of-Gaussian filter responses as features and 
then match parent structure vectors in a manner similar to De Bonet (1997). 19 The high- 
frequency gradient at each recognized training image location is then used as a constraint on 
the super-resolved image, along with the usual reconstruction (prediction) equation (10.27). 
Figure 10.33 shows the result of hallucinating higher-resolution faces from lower-resolution 
inputs; Baker and Kanade (2002) also show examples of super-resolving known-font text. 
Exercise 10.7 gives more details on how to implement and test one or more of these super- 
resolution techniques. 

Under favorable conditions, super-resolution and related upsampling techniques can in- 
crease the resolution of a well-photographed image or image collection. When the input 
images are blurry to start with, the best one can often hope for is to reduce the amount of blur. 
This problem is closely related super-resolution, with the biggest differences being that the 
blur kernel b is usually much larger and the downsampling factor D is unity. A large literature 
on image deblurring exists; some of the more recent publications with nice literature reviews 
include those by Fergus, Singh, Hertzmann et al. (2006), Yuan, Sun, Quan et al. (2008), and 
Joshi, Zitnick, Szeliski et al. (2009). It is also possible to reduce blur by combining sharp (but 
noisy) images with blurrier (but cleaner) images (Yuan, Sun, Quan et al. 2007), take lots of 
quick exposures 20 (Hasinoff and Kutulakos 2008; Hasinoff, Kutulakos, Durand et al. 2009; 
Hasinoff, Durand, and Freeman 2010), or use coded aperture techniques to simultaneously 

19 For face super-resolution, where all the images are pre-aligned, only corresponding pixels in different images 
are examined. 

20 The SONY DSC-WX1 takes multiple shots to produce better low-light photos. 
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Figure 10.33 Recognition-based super-resolution (Baker and Kanade 2002) © 2002 IEEE. 
The Hallucinated column shows the results of the recognition-based algorithm compared to 
the regularization-based approach of Hardie, Barnard, and Armstrong (1997). 


estimate depth and reduce blur (Levin, Fergus, Durand et ai. 2007; Zhou, Lin, and Nayar 
2009). 

10.3.1 Color image demosaicing 

A special case of super-resolution, which is used daily in most digital still cameras, is the 
process of demosaicing samples from a color filter array (CFA) into a full-color RGB image. 
Figure 10.34 shows the most commonly used CFA known as the Bayer pattern , which has 
twice as many green (G) sensors as red and blue sensors. 

The process of going from the known CFA pixels values to the full RGB image is quite 
challenging. Unlike regular super-resolution, where small errors in guessing unknown values 
usually show up as blur or aliasing, demosaicing artifacts often produce spurious colors or 
high-frequency patterned zippering, which are quite visible to the eye (Figure 10.35b). 

Over the years, a variety of techniques have been developed for image demosaicing (Kim- 
mel 1999). Bennett, Uyttendaele, Zitnick el ai. (2006) present a recently developed algorithm 
along with some good references, while Longere, Delahunt, Zhang et al. (2002) and Tappen, 
Russell, and Freeman (2003) compare some previously developed techniques using percep- 
tually motivated metrics. To reduce the zippering effect, most techniques use the edge or 
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(b) 


Figure 10.34 Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel values, 
with unknown (guessed) values shown as lower case. 



(c) (d) 


Figure 10.35 CFA demosaicing results (Bennett, Uyttendaele, Zitnick el al. 2006) © 2006 
Springer: (a) original full-resolution image (a color subsampled version is used as the input 
to the algorithms); (b) bilinear interpolation results, showing color fringing near the tip of the 
blue crayon and zippering near its left (vertical) edge; (c) the high-quality linear interpolation 
results of Malvar, He, and Cutler (2004) (note the strong halo/checkerboard artifacts on the 
yellow crayon); (d) using the local two-color prior of Bennett, Uyttendaele, Zitnick et al. 
(2006). 
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Figure 10.36 Two-color model computed from a collection of local 5x5 neighborhoods 
(Bennett, Uyttendaele, Zitnick el al. 2006) © 2006 Springer. After two-means clustering 
and reprojection along the line joining the two dominant colors (red dots), the majority of the 
pixels fall near the fitted line. The distribution along the line, projected along the RGB axes, 
is peaked at 0 and 1, the two dominant colors. 


gradient information from the green channel, which is more reliable because it is sampled 
more densely, to infer plausible values for the red and blue channels, which are more sparsely 
sampled. 

To reduce color fringing, some techniques perform a color space analysis, e.g., using 
median filtering on color opponent channels (Longere, Delahunt, Zhang el al. 2002). The 
approach of Bennett, Uyttendaele, Zitnick el al. (2006) locally forms a two-color model from 
an initial demosaicing result, using a moving 5x5 window to find the two dominant colors 
(Figure 10. 36). 21 

Once the local color model has been estimated at each pixel, a Bayesian approach is 
then used to encourage pixel values to lie along each color line and to cluster around the 
dominant color values, which reduces halos (Figure 10.35d). The Bayesian approach also 
supports the simultaneous application of demosaicing and super-resolution, i.e., multiple CFA 
inputs can be merged into a higher-quality full-color image, which becomes more important 
as additional processing becomes incorporated into today’s cameras. 

10.3.2 Application : Colorization 

Although not strictly an example of super-resolution, the process of colorization, i.e., manu- 
ally adding colors to a “black and white” (grayscale) image, is another example of a sparse 
interpolation problem. In most applications of colorization, the user draws some scribbles in- 
dicating the desired colors in certain regions (Figure 10.37a) and the system interpolates the 

- 1 Previous work on locally linear color models (Klinker, Shafer, and Kanade 1990; Omer and Werman 2004) 
focuses on color and illumination variation within a single material, whereas Bennett, Uyttendaele, Zitnick et al. 
(2006) use the two-color model to describe variations across color (material) edges. 
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(a) (b) (c) 


Figure 10.37 Colorization using optimization (Levin, Lischinski, and Weiss 2004) © 2004 
ACM: (a) grayscale image some color scribbles overlaid; (b) resulting colorized image; (c) 
original color image from which the grayscale image and the chrominance values for the 
scribbles were derived. Original photograph by Rotem Weiss. 


specified chrominance (u, v ) values to the whole image, which are then re-combined with the 
input luminance channel to produce a final colorized image, as shown in Figure 10.37b. In the 
system developed by Levin, Lischinski, and Weiss (2004), the interpolation is performed us- 
ing locally weighted regularization (3.100), where the local smoothness weights are inversely 
proportional to luminance gradients. This approach to locally weighted regularization has 
inspired later algorithms for high dynamic range tone mapping (Lischinski, Farbman, Uyt- 
tendaele et al. 2006a), see Section 10.2.1, as well as other applications of the weighted least 
squares (WLS) formulation (Farbman, Fattal, Lischinski et al. 2008). An alternative approach 
to performing the sparse chrominance interpolation based on geodesic (edge-aware) distance 
functions has been developed by Yatziv and Sapiro (2006). 

10.4 Image matting and compositing 

Image matting and compositing is the process of cutting a foreground object out of one image 
and pasting it against a new background (Smith and Blinn 1996; Wang and Cohen 2007a). 
It is commonly used in television and film production to composite a live actor in front of 
computer-generated imagery such as weather maps or 3D virtual characters and scenery 
(Wright 2006; Brinkmann 2008). 

We have already seen a number of tools for interactively segmenting objects in an image, 
including snakes (Section 5.1.1), scissors (Section 5.1.3), and GrabCut segmentation (Sec- 
tion 5.5). While these techniques can generate reasonable pixel-accurate segmentations, they 
fail to capture the subtle interplay of foreground and background colors at mixed pixels along 
the boundary (Szeliski and Golland 1999) (Figure 10.38a). 

In order to successfully copy a foreground object from one image to another without 
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Figure 10.38 Softening a hard segmentation boundary (border matting) (Rother, Kol- 
mogorov, and Blake 2004) © 2004 ACM: (a) the region surrounding a segmentation bound- 
ary where pixels of mixed foreground and background colors are visible; (b) pixel values 
along the boundary are used to compute a soft alpha matte; (c) at each point along the curve 
t, a displacement A and a width a are estimated. 


visible discretization artifacts, we need to pull a matte , i.e., to estimate a soft opacity channel 
a and the uncontaminated foreground colors F from the input composite image C. Recall 
from Section 3.1.3 (Figure 3.4) that the compositing equation (3.8) can be written as 

C = (1 — a)B + aF. (10.30) 

This operator attenuates the influence of the background image B by a factor (1 — a) and 
then adds in the (partial) color values corresponding to the foreground element F. 

While the compositing operation is easy to implement, the reverse matting operation of 
estimating F, a, and B given an input image C is much more challenging (Figure 10.39). 
To see why, observe that while the composite pixel color C provides three measurements, 
the F, a , and B unknowns have a total of seven degrees of freedom. Devising techniques to 
estimate these unknowns despite the underconstrained nature of the problem is the essence of 
image matting. 

In this section, we review a number of image matting techniques. We begin with blue 
screen matting, which assumes that the background is a constant known color, and discuss its 
variants, two-screen matting (when multiple backgrounds can be used) and difference matting 
(where the known background is arbitrary). We then discuss local variants of natural image 
matting, where both the foreground and background are unknown. In these applications, it is 
usual to first specify a trimap, i.e., a three-way labeling of the image into foreground, back- 
ground, and unknown regions (Figure 10.39b). Next, we present some global optimization 
approaches to natural image matting. Finally, we discuss variants on the matting problem, 
including shadow matting, flash matting, and environment matting. 
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(c) (d) (e) 


Figure 10.39 Natural image matting (Chuang, Curless, Salesin et al. 2001) © 2001 IEEE: 
(a) input image with a “natural” (non-constant) background; (b) hand-drawn trimap — gray 
indicates unknown regions; (c) extracted alpha map; (d) extracted (premultiplied) foreground 
colors; (e) composite over a new background. 


10.4.1 Blue screen matting 

Blue screen matting involves filming an actor (or object) in front of a constant colored back- 
ground. While originally bright blue was the preferred color, bright green is now more com- 
monly used (Wright 2006; Brinkmann 2008). Smith and Blinn (1996) discuss a number of 
techniques for blue screen matting, which are mostly described in patents rather than in the 
open research literature. Early techniques used linear combination of object color channels 
with user-tuned parameters to estimate the opacity a. 

Chuang, Curless, Salesin et al. (2001) describe a newer technique called Mishima’s al- 
gorithm, which involves fitting two polyhedral surfaces (centered at the mean background 
color), separating the foreground and background color distributions and then measuring the 
relative distance of a novel color to these surfaces to estimate a (Figure 10.41e). While this 
technique works well in many studio settings, it can still suffer from blue spill , where translu- 
cent pixels around the edges of an object acquire some of the background blue coloration 
(Figure 10.40). 

Two-screen matting. In their paper. Smith and Blinn (1996) also introduce an algorithm 
called triangulation matting that uses more than one known background color to over-constrain 
the equations required to estimate the opacity a and foreground color F. 
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Alpha Matte Composite Inset 


Figure 10.40 Blue-screen matting results (Chuang, Curless, Salesin et al. 2001) © 2001 
IEEE. Mishima’s method produces visible blue spill (color fringing in the hair), while 
Chuang’s Bayesian matting approach produces accurate results. 


For example, consider in the compositing equation (10.30) setting the background color 
to black, i.e., B = 0. The resulting composite image C is therefore equal to aF. Replacing 
the background color with a different known non-zero value B now results in 

C - aF = (1 - ot)B, (10.31) 

which is an overconstrained set of (color) equations for estimating a. In practice, B should 
be chosen so as not to saturate C and, for best accuracy, several values of B should be used. 
It is also important that colors be linearized before processing, which is the case for all image 
matting algorithms. Papers that generate ground truth alpha mattes for evaluation purposes 
normally use these techniques to obtain accurate matte estimates (Chuang, Curless, Salesin 
et al. 2001; Wang and Cohen 2007b; Levin, Ac ha, and Lischinski 2008; Rhemann, Rother, 
Rav-Acha et al. 2008; Rhemann, Rother, Wang et al. 2009). 22 Exercise 10.8 has you do this 
as well. 


22 


See the alpha matting evaluation Web site at http://alphamatting.com/. 
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Difference matting. A related approach when the background is irregular but known is 
called difference matting (Wright 2006; Brinkmann 2008). It is most commonly used when 
the actor or object is filmed against a static background, e.g., for office videoconferencing, 
person tracking applications (Toyama, Krumm, Brumitt el al. 1999), or to produce silhou- 
ettes for volumetric 3D reconstruction techniques (Section 11.6.2) (Szeliski 1993; Seitz and 
Dyer 1997; Seitz, Curless, Diebel et al. 2006). It can also be used with a panning camera 
where the background is composited from frames where the foreground has been removed 
using a garbage matte (Section 10.4.5) (Chuang, Agarwala, Curless et al. 2002). Another 
recent application is the detection of visual continuity errors in films, i.e., differences in the 
background when a shot is re-taken at later time (Pickup and Zisserman 2009). 

In the case where the foreground and background motions can both be specified with 
parametric transforms, high-quality mattes can be extracted using a generalization of triangu- 
lation matting (Wexler, Fitzgibbon, and Zisserman 2002). When frames need to be processed 
independently, however, the results are often of poor quality (Figure 10.42). In such cases, 
using a pair of stereo cameras as input can dramatically improve the quality of the results 
(Criminisi, Cross, Blake et al. 2006; Yin, Criminisi, Winn et al. 2007). 


10.4.2 Natural image matting 

The most general version of image matting is when nothing is known about the background 
except, perhaps, for a rough segmentation of the scene into foreground, background, and 
unknown regions, which is known as the trimap (Figure 10.39b). Some recent techniques, 
however, relax this requirement and allow the user to just draw a few strokes or scribbles in 
the image, see Figures 10.45 and 10.46 (Wang and Cohen 2005; Wang, Agrawala, and Cohen 
2007; Levin, Lischinski, and Weiss 2008; Rhemann, Rother, Rav-Acha et al. 2008; Rhemann, 
Rother, and Gelautz 2008). Fully automated single image matting results have also been 
reported (Levin, Acha, and Lischinski 2008; Singaraju, Rother, and Rhemann 2009). The 
survey paper by Wang and Cohen (2007a) has detailed descriptions and comparisons of all of 
these techniques, a selection of which are described briefly below. 

A relatively simple algorithm for performing natural image matting is Knockout, as de- 
scribed by Chuang, Curless, Salesin et al. (2001) and illustrated in Figure 10.4 1 f. In this 
algorithm, the nearest known foreground and background pixels (in image space) are deter- 
mined and then blended with neighboring known pixels to produce a per-pixel foreground F 
and background B color estimate. The background color is then adjusted so that the measured 
color C lies on the line between F and B. Finally, opacity a is estimated on a per-channel 
basis, and the three estimates are combined based on per-channel color differences. (This is 
an approximation to the least squares solution for a.) Figure 10.42 shows that Knockout has 
problems when the background consists of more than one dominant local color. 
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Figure 10.41 Image matting algorithms (Chuang, Curless, Salesin et al. 2001) © 2001 
IEEE. Mishima’s algorithm models global foreground and background color distribution as 
polyhedral surfaces centered around the mean background (blue) color. Knockout uses a lo- 
cal color estimate of foreground and background for each pixel and computes a along each 
color axis. Ruzon and Tomasi’s algorithm locally models foreground and background colors 
and variances. Chuang et al.’s Bayesian matting approach computes a MAP estimate of (frac- 
tional) foreground color and opacity given the local foreground and background distributions. 


More accurate matting results can be obtained if we treat the foreground and background 
colors as distributions sampled over some region (Figure 10.41g-h). Ruzon and Tomasi 
(2000) model local color distributions as mixtures of (uncorrelated) Gaussians and compute 
these models in strips. They then find the pairing of mixture components F and B that best 
describes the observed color C, compute the a as the relative distance between these means, 
and adjust the estimates of F and B so they are collinear with C. 

Chuang, Curless, Salesin et al. (2001) and Hillman, Hannah, and Renshaw (2001) use 
full 3x3 color covariance matrices to model mixtures of correlated Gaussians, and compute 
estimates independently for each pixel. Matte extraction proceeds in strips starting from 
known color values growing into the unknown regions, so that recently computed F and B 
colors can be used in later stages. 

To estimate the most likely value of an unknown pixel’s opacity and (unmixed) foreground 
and background colors, Chuang et al. use a fully Bayesian formulation that maximizes 


P(F,B,a\C) = P(C\F,B,a)P(F)P(B)P(a)/P(C). 


(10.32) 
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Figure 10.42 Natural image matting results (Chuang, Curless, Salesin et al. 2001) © 2001 
IEEE. Difference matting and Knockout both perform poorly on this kind of background, 
while the more recent natural image matting techniques perform well. Chuang et al .' s results 
are slightly smoother and closer to the ground truth. 
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This is equivalent to minimizing the negative log likelihood 


L(F, B, a\C) = L(C\F, B, a) + L(F) + L(B) + L{a) 


(10.33) 


(dropping the L(C) term since it is constant). 

Let us examine each of these terms in turn. The first, L(C\F, IF a), is the likelihood that 
pixel color C was observed given values for the unknowns ( F. IF a). If we assume Gaussian 
noise in our observation with variance a ‘ q, this negative log likelihood (data term) is 


as illustrated in Figure 10.41h. 

The second term, L(F), corresponds to the likelihood that a particular foreground color F 
comes from the mixture of Gaussians distribution. After partitioning the sample foreground 
colors into clusters, a weighted mean and covariance is computed, where the weights are 
proportional to a given foreground pixel’s opacity and distance from the unknown pixel. The 
negative log likelihood for each cluster is thus given by 


A similar method is used to estimate unknown background color distributions. If the back- 
ground is already known, i.e., for blue screen or difference matting applications, its measured 
color value and variance are used instead. 

An alternative to modeling the foreground and background color distributions as mixtures 
of Gaussians is to keep around the original color samples and to compute the most likely 
pairings that explain the observed color C (Wang and Cohen 2005, 2007b). These techniques 
are described in more detail in (Wang and Cohen 2007a). 

In their Bayesian matting paper, Chuang, Curless, Salesin et al. (2001) assume a constant 
(non-informative) distribution for L(a). More recent papers assume this distribution to be 
more peaked around 0 and 1, or sometimes use Markov random fields (MRFs) to define a 
global correlated prior on P(a) (Wang and Cohen 2007a). 

To compute the most likely estimates for ( F , IF a), the Bayesian matting algorithm alter- 
nates between computing (F, B) and a , since each of these problems is quadratic and hence 
can be solved as a small linear system. When several color clusters are estimated, the most 
likely pairing of foreground and background color clusters is used. 

Bayesian image matting produces results that improve on the original natural image mat- 
ting algorithm by Ruzon and Tomasi (2000), as can be seen in Figure 10.42. However, com- 
pared to more recent techniques (Wang and Cohen 2007a), its performance is not as good for 
complex background or inaccurate trimaps (Figure 10.44). 


L(C)= i/ 2 \\C-[aF + (l-a)B\\\ 2 /(T 2 c 


(10.34) 


L(F) = (F - F) t E~\F - F). 


(10.35) 
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10.4.3 Optimization-based matting 


An alternative to estimating each pixel’s opacity and foreground color independently is to use 
global optimization to compute a matte that takes into account correlations between neigh- 
boring a values. Two examples of this are border matting in the GrabCut interactive segmen- 
tation system (Rother, Kolmogorov, and Blake 2004) and Poisson Matting (Sun, Jia, Tang el 
al. 2004). 

Border matting first dilates the region around the binary segmentation produced by Grab- 
Cut (Section 5.5) and then solves for a sub-pixel boundary location A and a blur width a for 
every point along the boundary (Figure 10.38). Smoothness in these parameters along the 
boundary is enforced using regularization and the optimization is performed using dynamic 
programming. While this technique can obtain good results for smooth boundaries, such as a 
person’s face, it has difficulty with fine details, such as hair. 

Poisson matting (Sun, Jia, Tang el al. 2004) assumes a known foreground and background 
color for each pixel in the trimap (as with Bayesian matting). However, instead of indepen- 
dently estimating each a value, it assumes that the gradient of the alpha matte and the gradient 
of the color image are related by 


Va = 


F~B 

\\F-B\r- 


■ VC, 


(10.36) 


which can be derived by taking gradients of both sides of (10.30) and assuming that the 
foreground and background vary slowly. The per-pixel gradient estimates are then integrated 
into a continuous a(x) field using the regularization (least squares) technique first described 
in Section 3.7.1 (3.100) and subsequently used in Poisson blending (Section 9.3.4, 9.44) and 
gradient-based dynamic range compression mapping (Section 10.2.1, 10.19). This technique 
works well when good foreground and background color estimates are available and these 
colors vary slowly. 

Instead of computing per-pixel foreground and background colors. Levin, Lischinski, and 
Weiss (2008) assume only that these color distribution can locally be well approximated as 
mixtures of two colors, which is known as the color line model (Figure 10.43a-c). Under this 
assumption, a closed-form estimate for a at each pixel i in a (say, 3x3) window Wk is given 
by 

oii = a k ■ (Ci - B 0 ) = a k • C + b k , (10.37) 

where C,; is the pixel color treated as a three-vector, B 0 is any pixel along the background 
color line, and «./. is the vector joining the two closest points on the foreground and back- 
ground color lines, as shown in Figure 10.43c. (Note that the geometric derivation shown 
in this figure is an alternative to the algebraic derivation presented by Levin, Lischinski, and 
Weiss (2008).) Minimizing the deviations of the alpha values o:, from their respective color 
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Figure 10.43 Color line matting (Levin, Lischinski, and Weiss 2008): (a) local 3x3 patch 
of colors; (b) potential assignment of a values; (c) foreground and background color lines, 
the vector a k joining their closest points of intersection, and the family of parallel planes of 
constant a values, ai = a k ■ ( Cj — -Bo); (d) a scatter plot of sample colors and the deviations 
from the mean ///,. for two sample colors C, and Cj. 


line models (10.37) over all overlapping windows Wk in the image gives rise to the cost 

E a = E ( 51 (“< - “fc ' C i - b kf + e KH ) > (10.38) 

k \i&W k / 

where the e term is used to regularize the value of a k in the case where the two color distri- 
butions overlap (i.e., in constant a regions). 

Because this formula is quadratic in the unknowns { (a*, b k ) }, they can be eliminated 
inside each window W k , leading to a final energy 

E a = a T La , (10.39) 

where the entries in the L matrix are given by 

L a = E (*«-ir ( 1 + ( c ‘ 

k:iew k Ajew k ^ 

where M = W/ : is the number of pixels in each (overlapping) window, fi k is the mean color 
of the pixels in window Wk, and is the 3x3 covariance of the pixel colors plus e /\]I . 

Figure 10.43d shows the intuition behind the entries in this affinity matrix, which is called 
the matting Laplacian. Note how when two pixels C , and Cj in Wk point in opposite direc- 
tions away from the mean p k , their weighted dot product is close to —1, and so their affinity 
becomes close to 0. Pixels close to each other in color space (and hence with similar expected 
a values) will have affinities close to —2/M. 

Minimizing the quadratic energy (10.39) constrained by the known values of a = {0, 1} 
at scribbles only requires the solution of a sparse set of linear equations, which is why the 


- p k ) T t k \Cj - p k ))y (10.40) 
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Figure 10.44 Comparative matting results for a medium accuracy trimap. Wang and Cohen 
(2007a) describe the individual techniques being compared. 


authors call their technique a closed-form solution to natural image matting. Once a has 
been computed, the foreground and background colors are estimated using a least squares 
minimization of the compositing equation (10.30) regularized with a spatially varying first- 
order smoothness, 

E b ,f = W Ci -[« + # + ( 1 - aO^i] f + AlV^KHV^f + || VBif), (10.41) 

where the |Vaj| weight is applied separately for the x and y components of the F and B 
derivatives (Levin, Lischinski, and Weiss 2008). 

Laplacian (closed-form) matting is just one of many optimization-based techniques sur- 
veyed and compared by Wang and Cohen (2007a). Some of these techniques use alternative 
formulations for the affinities or smoothness terms on the a matte, alternative estimation 
techniques such as belief propagation, or alternative representations (e.g., local histograms) 
for modeling local foreground and background color distributions (Wang and Cohen 2005, 
2007b,c). Some of these techniques also provide real-time results as the user draws a contour 
line or sparse set of scribbles (Wang, Agrawala, and Cohen 2007; Rhemann, Rother, Rav- 
Acha el al. 2008) or even pre-segment the image into a small number of mattes that the user 
can select with simple clicks (Levin, Acha, and Lischinski 2008). 

Figure 10.44 shows the results of running a number of the surveyed algorithms on a region 
of toy animal fur where a trimap has been specified, while Figure 10.45 shows results for 
techniques that can produce mattes with only a few scribbles as input. Figure 10.46 shows 
a result for an even more recent algorithm (Rhemann, Rother, Rav-Acha el al. 2008) that 
claims to outperform all of the techniques surveyed by Wang and Cohen (2007a). 

Pasting. Once a matte has been pulled from an image, it is usually composited directly 
over the new background, unless the seams between the cutout and background regions are 
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Figure 10.45 Comparative matting results with scribble-based inputs. Wang and Cohen 
(2007a) describe the individual techniques being compared. 



Figure 10.46 Stroke-based segmentation result (Rhemann, Rother, Rav-Acha et al. 2008) 
© 2008 IEEE. 


to be hidden, in which case Poisson blending (Perez, Gangnet, and Blake 2003) can be used 
(Section 9.3.4). 

In the latter case, it is helpful if the matte boundary passes through regions that either 
have little texture or look similar in the old and new images. Papers by Jia, Sun, Tang et al. 
(2006) and Wang and Cohen (2007c) explain how to do this. 


10.4.4 Smoke, shadow, and flash matting 

In addition to matting out solid objects with fractional boundaries, it is also possible to matte 
out translucent media such as smoke (Chuang, Agarwala, Curless et al. 2002). Starting with 
a video sequence, each pixel is modeled as a linear combination of its (unknown) background 
color and a constant foreground (smoke) color that is common to all pixels. Voting in color 
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(a) (b) (c) (d) 

Figure 10.47 Smoke matting (Chuang, Agarwala, Curless et al. 2002) © 2002 ACM: (a) 
input video frame; (b) after removing the foreground object; (c) estimated alpha matte; (d) 
insertion of new objects into the background. 



Figure 10.48 Shadow matting (Chuang, Goldman, Curless et al. 2003) © 2003 ACM. In- 
stead of simply darkening the new scene with the shadow (c), shadow matting correctly dims 
the lit scene with the new shadow and drapes the shadow over 3D geometry (d). 


space is used to estimate this foreground color and the distance along each color line is used 
to estimate the per-pixel temporally varying alpha (Figure 10.47). 

Extracting and re-inserting shadows is also possible using a related technique (Chuang, 
Goldman, Curless et al. 2003). Here, instead of assuming a constant foreground color, each 
pixel is assumed to vary between its fully lit and fully shadowed colors, which can be esti- 
mated by taking (robust) minimum and maximum values over time as a shadow passes over 
the scene (Exercise 10.9). The resulting fractional shadow matte can be used to re-project 
the shadow into a new scene. If the destination scene has a non-planar geometry, it can be 
scanned by waving a straight stick shadow across the scene. The new shadow matte can then 
be warped with the computed deformation field to have it drape correctly over the new scene 
(Figure 10.48). 

The quality and reliability of matting algorithms can also be enhanced using more sophis- 
ticated acquisition systems. For example, taking a flash and non-flash image pair supports 
the reliable extraction of foreground mattes, which show up as regions of large illumination 
change between the two images (Sun, Li, Kang et al. 2006). Taking simultaneous video 
streams focused at different distances (McGuire, Matusik, Pfister et al. 2005) or using multi- 
camera arrays (Joshi, Matusik, and Avidan 2006) are also good approaches to producing 
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high-quality mattes. These techniques are described in more detail in (Wang and Cohen 
2007a). 

Lastly, photographing a refractive object in front of a number of patterned backgrounds 
allows the object to be placed in novel 3D environments. These environment matting tech- 
niques (Zongker, Werner, Curless el al. 1999; Chuang, Zongker, Hindorff el al. 2000) are 
discussed in Section 13.4. 

10.4.5 Video matting 

While regular single-frame matting techniques such as blue or green screen matting (Smith 
and Blinn 1996; Wright 2006; Brinkmann 2008) can be applied to video sequences, the pres- 
ence of moving objects can sometimes make the matting process easier, as portions of the 
background may get revealed in preceding or subsequent frames. 

Chuang, Agarwala, Curless el al. (2002) describe a nice approach to this video matting 
problem, where foreground objects are first removed using a conservative garbage matte and 
the resulting background plates are aligned and composited to yield a high-quality back- 
ground estimate. They also describe how trimaps drawn at sparse keyframes can be inter- 
polated to in-between frames using bi-direction optic flow. Alternative approaches to video 
matting, such as rotoscoping, which involves drawing and tracking curves in video sequences 
(Agarwala, Hertzmann, Seitz et al. 2004), are discussed in the matting survey paper by Wang 
and Cohen (2007a). 


10.5 Texture analysis and synthesis 

While texture analysis and synthesis may not at first seem like computational photography 
techniques, they are, in fact, widely used to repair defects, such as small holes, in images or 
to create non-photorealistic painterly renderings from regular photographs. 

The problem of texture synthesis can be formulated as follows: given a small sample of 
a “texture” (Figure 10.49a), generate a larger similar-looking image (Figure 10.49b). As you 
can imagine, for certain sample textures, this problem can be quite challenging. 

Traditional approaches to texture analysis and synthesis try to match the spectrum of the 
source image while generating shaped noise. Matching the frequency characteristics, which 
is equivalent to matching spatial correlations, is in itself not sufficient. The distributions of 
the responses at different frequencies must also match. Heeger and Bergen (1995) develop an 
algorithm that alternates between matching the histograms of multi-scale (steerable pyramid) 
responses and matching the final image histogram. Portilla and Simoncelli (2000) improve 
on this technique by also matching pairwise statistics across scale and orientations. De Bonet 
(1997) uses a coarse-to-fine strategy to find locations in the source texture with a similar 
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Figure 10.49 Texture synthesis: (a) given a small patch of texture, the task is to synthesize 
(b) a similar-looking larger patch; (c) other semi-structured textures that are challenging to 
synthesize. (Images courtesy of Alyosha Efros.) 


parent structure , i.e., similar multi-scale oriented filter responses, and then randomly chooses 
one of these matching locations as the current sample value. 

More recent texture synthesis algorithms sequentially generate texture pixels by looking 
for neighborhoods in the source texture that are similar to the currently synthesized image 
(Efros and Leung 1999). Consider the (as yet) unknown pixel p in the partially constructed 
texture on the left side of Figure 10.50. Since some of its neighboring pixels have been 
already been synthesized, we can look for similar partial neighborhoods in the sample texture 
image on the right and randomly select one of these as the new value of p. This process 
can be repeated down the new image either in a raster fashion or by scanning around the 
periphery (“onion peeling”) when filling holes, as discussed in (Section 10.5.1). In their 
actual implementation, Efros and Leung (1999) find the most similar neighborhood and then 
include all other neighborhoods within a d = (1 + e) distance, with e = 0.1. They also 
optionally weight the random pixel selections by the similarity metric d. 

To accelerate this process and improve its visual quality, Wei and Levoy (2000) extend 
this technique using a coarse-to-fine generation process, where coarser levels of the pyramid, 
which have already been synthesized, are also considered during the matching (De Bonet 
1997). To accelerate the nearest neighbor finding, tree-structured vector quantization is used. 

Efros and Freeman (2001) propose an alternative acceleration and visual quality improve- 
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Figure 10.50 Texture synthesis using non-parametric sampling (Efros and Leung 1999). 
The value of the newest pixel p is randomly chosen from similar local (partial) patches in the 
source texture (input image). (Figure courtesy of Alyosha Efros.) 



Minimal error 
boundary cut 



Input image 


Figure 10.51 Texture synthesis by image quilting (Efros and Freeman 2001). Instead of 
generating a single pixel at a time, larger blocks are copied from the source texture. The tran- 
sitions in the overlap regions between the selected blocks are then optimized using dynamic 
programming. (Figure courtesy of Alyosha Efros.) 


ment technique. Instead of synthesizing a single pixel at a time, overlapping square blocks 
are selected using similarity with previously synthesized regions (Figure 10.51). Once the 
appropriate blocks have been selected, the seam between newly overlapping blocks is deter- 
mined using dynamic programming. (Full graph cut seam selection is not required, since only 
one seam location per row is needed for a vertical boundary.) Because this process involves 
selecting small patches and them stitching them together, Efros and Freeman (2001) call their 
system image quilting. Komodakis and Tziritas (2007b) present an MRF-based version of 
this block synthesis algorithm that uses a new, efficient version of loopy belief propagation 
they call “Priority-BP”. 
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(a) (b) (c) (d) 


Figure 10.52 Image inpainting (hole filling): (a-b) propagation along isophote directions 
(Bertalmio, Sapiro, Caselles et al. 2000) © 2000 ACM; (c-d) exemplar-based inpainting 
with confidence-based filling order (Criminisi, Perez, and Toyama 2004). 


10.5.1 Application : Hole filling and inpainting 

Filling holes left behind when objects or defects are excised from photographs, which is 
known as inpainting, is one of the most common applications of texture synthesis. Such 
techniques are used not only to remove unwanted people or interlopers from photographs 
(King 1997) but also to fix small defects in old photos and movies ( scratch removal) or to 
remove wires holding props or actors in mid-air during filming ( wire removal). Bertalmio, 
Sapiro, Caselles et al. (2000) solve the problem by propagating pixel values along isophote 
(constant-value) directions interleaved with some anisotropic diffusion steps (Figure 10.52a- 
b). Telea (2004) develops a faster technique that uses the fast marching method from level 
sets (Section 5.1.4). However, these techniques will not hallucinate texture in the missing 
regions. Bertalmio, Vese, Sapiro et al. (2003) augment their earlier technique by adding 
synthetic texture to the infilled regions. 

The example-based (non-parametric) texture generation techniques discussed in the pre- 
vious section can also be used by filling the holes from the outside in (the “onion-peel” or- 
dering). However, this approach may fail to propagate strong oriented stmctures. Criminisi, 
Perez, and Toyama (2004) use exemplar-based texture synthesis where the order of synthesis 
is determined by the strength of the gradient along the region boundary (Figures 10. Id and 
10.52c-d). Sun, Yuan, Jia et al. (2004) present a related approach where the user draws in- 
teractive lines to indicate where structures should be preferentially propagated. Additional 
techniques related to these approaches include those developed by Drori, Cohen-Or, and 
Yeshurun (2003), Kwatra, Schodl, Essa et al. (2003), Kwatra, Essa, Bobick et al. (2005), 
Wilczkowiak, Brostow, Tordoff et al. (2005), Komodakis and Tziritas (2007b), and Wexler, 
Shechtman, and Irani (2007). 

Most hole filling algorithms borrow small pieces of the original image to fill in the holes. 
When a large database of source images is available, e.g., when images are taken from a 




522 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



(a) (b) (c) 


Figure 10.53 Texture transfer (Efros and Freeman 2001) © 2001 ACM: (a) reference (tar- 
get) image; (b) source texture; (c) image (partially) rendered using the texture. 

photo sharing site or the Internet, it is sometimes possible to copy a single contiguous image 
region to fill the hole. Hays and Efros (2007) present such a technique, which uses image 
context and boundary compatibility to select the source image, which is then blended with 
the original (holey) image using graph cuts and Poisson blending. This technique is discussed 
in more detail in Section 14.4.4 and Figure 14.46. 

10.5.2 Application : Non-photorealistic rendering 

Two more applications of the exemplar-based texture synthesis ideas are texture transfer 
(Efros and Freeman 2001) and image analogies (Hertzmann, Jacobs, Oliver et al. 2001), 
which are both examples of non-photorealistic rendering (Gooch and Gooch 2001). 

In addition to using a source texture image, texture transfer also takes a reference (or 
target) image, and tries to match certain characteristics of the target image with the newly 
synthesized image. For example, the new image being rendered in Figure 10.53c not only 
tries to satisfy the usual similarity constraints with the source texture in Figure 10.53b, but it 
also tries to match the luminance characteristics of the reference image. Efros and Freeman 
(2001) mention that blurred image intensities or local image orientation angles are alternative 
quantities that could be matched. 

Hertzmann, Jacobs, Oliver et al. (2001) formulate the following problem: 

Given a pair of images A and A! (the unfiltered and filtered source images, re- 
spectively), along with some additional unfiltered target image B, synthesize a 
new filtered target image B' such that 


A: A' :: B : B' . 
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A A! B B' 


Figure 10.54 Image analogies (Hertzmann, Jacobs, Oliver el al. 2001) © 2001 ACM. Given 
an example pair of a source image A and its rendered (filtered) version A', generate the 
rendered version B' from another unfiltered source image B. 


Instead of having the user program a certain non-photorealistic rendering effect, it is sufficient 
to supply the system with examples of before and after images, and let the system synthesize 
the novel image using exemplar-based synthesis, as shown in Figure 10.54. 

The algorithm used to solve image analogies proceeds in a manner analogous to the tex- 
ture synthesis algorithms of (Efros and Leung 1999; Wei and Levoy 2000). Once Gaus- 
sian pyramids have been computed for all of the source and reference images, the algorithm 
looks for neighborhoods in the source filtered pyramids generated from A' that are simi- 
lar to the partially constructed neighborhood in B', while at the same time having similar 
multi-resolution appearances at corresponding locations in A and B. As with texture trans- 
fer, appearance characteristics can include not only (blurred) color or luminance values but 
also orientations. 

This general framework allows image analogies to be applied to a variety of rendering 
tasks. In addition to exemplar-based non-photorealistic rendering, image analogies can be 
used for traditional texture synthesis, super-resolution, and texture transfer (using the same 
textured image for both A and A'). If only the filtered (rendered) image A' is available, as 
is the case with paintings, the missing reference image A can be hallucinated using a smart 
(edge preserving) blur operator. Finally, it is possible to train a system to perform texture-by- 
numbers by manually painting over a natural image with pseudocolors corresponding to pix- 
els’ semantic meanings, e.g., water, trees, and grass (Figure 10.55a-b). The resulting system 
can then convert a novel sketch into a fully rendered synthetic photograph (Figure 10.55c-d). 
In more recent work, Cheng, Vishwanathan, and Zhang (2008) add ideas from image quilting 
(Efros and Freeman 2001) and MRF inference (Komodakis, Tziritas, and Paragios 2008) to 
the basic image analogies algorithm, while Ramanarayanan and Bala (2007) recast this pro- 
cess as energy minimization, which means it can also be viewed as a conditional random field 
(Section 3.7.2), and devise an efficient algorithm to find a good minimum. 

More traditional filtering and feature detection techniques can also be used for non- 
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Original A' Painted A Novel painted B Novel textured B' 


Figure 10.55 Texture-by-numbers (Hertzmann, Jacobs, Oliver et al. 2001) © 2001 ACM. 
Given a textured image A 1 and a hand-labeled (painted) version A , synthesize a new image 
B' given just the painted version B. 



(a) (b) 


Figure 10.56 Non-photorealistic abstraction of photographs: (a) DeCarlo and Santella 
(2002) © 2002 ACM and (b) Farbman, Fattal, Lischinski el al. (2008) © 2008 ACM. 

photorealistic rendering. 23 For example, pen-and-ink illustration (Winkenbach and Salesin 
1994) and painterly rendering techniques (Litwinowicz 1997) use local color, intensity, and 
orientation estimates as an input to their procedural rendering algorithms. Techniques for 
stylizing and simplifying photographs and video (DeCarlo and Santella 2002; Winnemoller, 
Olsen, and Gooch 2006; Farbman, Fattal, Lischinski et al. 2008), as in Figure 10.56, use 
combinations of edge-preserving blurring (Section 3.3.1) and edge detection and enhance- 
ment (Section 4.2.3). 


10.6 Additional reading 

A good overview of computational photography can be found in the book by Raskar and 
Tumblin (2010), survey articles by Nayar (2006), Cohen and Szeliski (2006), Levoy (2006), 

23 For a good selection of papers, see the Symposia on Non-Photorealistic Animation and Rendering (NPAR) at 
http://www.npar.org/. 
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Debevec (2006), and Hayes (2008), as well as two special journal issues edited by Bimber 
(2006) and Durand and Szeliski (2007). Notes from the courses on computational photog- 
raphy mentioned at the beginning of this chapter are another great source of material and 
references. 24 

The sub-field of high dynamic range imaging has its own book discussing research in this 
area (Reinhard, Ward, Pattanaik et al. 2005), as well as some books describing related pho- 
tographic techniques (Freeman 2008; Gulbins and Gulbins 2009). Algorithms for calibrating 
the radiometric response function of a camera can be found in articles by Mann and Picard 
(1995), Debevec and Malik (1997), and Mitsunaga and Nayar (1999). 

The subject of tone mapping is treated extensively in (Reinhard, Ward, Pattanaik et al. 
2005). Representative papers from the large volume of literature on this topic include those 
by Tumblin and Rushmeier (1993), Larson, Rushmeier, and Piatko (1997), Pattanaik, Ferw- 
erda, Fairchild et al. (1998), Tumblin and Turk (1999), Durand and Dorsey (2002), Fattal, 
Lischinski, and Werman (2002), Reinhard, Stark, Shirley et al. (2002), Lischinski, Farbman, 
Uyttendaele et al. (2006b), and Farbman, Fattal, Lischinski et al. (2008). 

The literature on super-resolution is quite extensive (Chaudhuri 2001; Park, Park, and 
Kang 2003; Capel and Zisserman 2003; Capel 2004; van Ouwerkerk 2006). The term super- 
resolution usually describes techniques for aligning and merging multiple images to produce 
higher-resolution composites (Keren, Peleg, and Brada 1988; Irani and Peleg 1991; Cheese- 
man, Kanefsky, Hanson et al. 1993; Mann and Picard 1994; Chiang and Boult 1996; Bascle, 
Blake, and Zisserman 1996; Capel and Zisserman 1998; Smelyanskiy, Cheeseman, Maluf et 
al. 2000; Capel and Zisserman 2000; Pickup, Capel, Roberts et al. 2009; Gulbins and Gul- 
bins 2009). However, single-image super-resolution techniques have also been developed 
(Freeman, Jones, and Pasztor 2002; Baker and Kanade 2002; Fattal 2007). 

A good survey on image matting is given by Wang and Cohen (2007a). Representative 
papers, which include extensive comparisons with previous work, include those by Chuang, 
Curless, Salesin et al. (2001), Wang and Cohen (2007b), Levin, Acha, and Lischinski (2008), 
Rhemann, Rother, Rav-Acha et al. (2008), and Rhemann, Rother, Wang et al. (2009). 

The literature on texture synthesis and hole filling includes traditional approaches to tex- 
ture synthesis, which try to match image statistics between source and destination images 
(Heeger and Bergen 1995; De Bonet 1997; Portilla and Simoncelli 2000), as well as newer 
approaches, which search for matching neighborhoods or patches inside the source sample 
(Efros and Leung 1999; Wei and Levoy 2000; Efros and Freeman 2001). In a similar vein, 
traditional approaches to hole filling involve the solution of local variational (smooth continu- 
ation) problems (Bertalmio, Sapiro, Caselles et al. 2000; Bertalmio, Vese, Sapiro et al. 2003; 

24 MIT 6.815/6.865, http://stellar.mit.edU/S/course/6/sp08/6.815/materials.html, CMU 15-463. http://graphics.cs. 
cmu.edu/courses/15-463/2008_fall/, Stanford CS 448A, http://graphics.stanford.edu/courses/cs448a-08-spring/, and 
SIGGRAPH courses, http://web.media.mit.edu/~raskar/photo/. 
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Telea 2004). More recent techniques use data-driven texture synthesis approaches (Drori, 
Cohen-Or, and Yeshurun 2003; Kwatra, Schodl, Essa et al. 2003; Criminisi, Perez, and 
Toyama 2004; Sun, Yuan, Jia et al. 2004; Kwatra, Essa, Bobick et al. 2005; Wilczkowiak, 
Brostow, Tordoff et al. 2005; Komodakis and Tziritas 2007b; Wexler, Shechtman, and Irani 
2007). 

10.7 Exercises 

Ex 10.1: Radiometric calibration Implement one of the multi-exposure radiometric cali- 
bration algorithms described in Section 10.2 (Debevec and Malik 1997; Mitsunaga and Nayar 
1999; Reinhard, Ward, Pattanaik et al. 2005). This calibration will be useful in a number of 
different applications, such as stitching images or stereo matching with different exposures 
and shape from shading. 

1. Take a series of bracketed images with your camera on a tripod. If your camera has 
an automatic exposure bracketing (AEB) mode, taking three images may be sufficient 
to calibrate most of your camera’s dynamic range, especially if your scene has a lot of 
bright and dark regions. (Shooting outdoors or through a window on a sunny day is 
best.) 

2. If your images are not taken on a tripod, first perform a global alignment (similarity 
transform). 

3. Estimate the radiometric response function using one of the techniques cited above. 

4. Estimate the high dynamic range radiance image by selecting or blending pixels from 
different exposures (Debevec and Malik 1997; Mitsunaga and Nayar 1999; Eden, Uyt- 
tendaele, and Szeliski 2006). 

5. Repeat your calibration experiments under different conditions, e.g., indoors under in- 
candescent light, to get a sense for the range of color balancing effects that your camera 
imposes. 

6. If your camera supports RAW and JPEG mode, calibrate both sets of images simulta- 
neously and to each other (the radiance at each pixel will correspond). See if you can 
come up with a model for what your camera does, e.g., whether it treats color balance 
as a diagonal or full 3x3 matrix multiply, whether it uses non-linearities in addition 
to gamma, whether it sharpens the image while “developing” the JPEG image, etc. 

7. Develop an interactive viewer to change the exposure of an image based on the average 
exposure of a region around the mouse. (One variant is to show the adjusted image 
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inside a window around the mouse. Another is to adjust the complete image based on 
the mouse position.) 

8. Implement a tone mapping operator (Exercise 10.5) and use this to map your radiance 
image to a displayable gamut. 

Ex 10.2: Noise level function Determine your camera’s noise level function using either 
multiple shots or by analyzing smooth regions. 

1 . Set up your camera on a tripod looking at a calibration target or a static scene with a 
good variation in input levels and colors. (Check your camera’s histogram to ensure 
that all values are being sampled.) 

2. Take repeated images of the same scene (ideally with a remote shutter release) and 
average them to compute the variance at each pixel. Discarding pixels near high gra- 
dients (which are affected by camera motion), plot for each color channel the standard 
deviation at each pixel as a function of its output value. 

3. Fit a lower envelope to these measurements and use this as your noise level function. 
How much variation do you see in the noise as a function of input level? How much of 
this is significant, i.e., away from flat regions in your camera response function where 
you do not want to be sampling anyway? 

4. (Optional) Using the same images, develop a technique that segments the image into 
near-constant regions (Liu, Szeliski, Kang et al. 2008). (This is easier if you are pho- 
tographing a calibration chart.) Compute the deviations for each region from a single 
image and use them to estimate the NLF. How does this compare to the multi-image 
technique, and how stable are your estimates from image to image? 

Ex 10.3: Vignetting Estimate the amount of vignetting in some of your lenses using one of 
the following three techniques (or devise one of your choosing): 

1. Take an image of a large uniform intensity region (well-illuminated wall or blue sky — 
but be careful of brightness gradients) and fit a radial polynomial curve to estimate the 
vignetting. 

2. Construct a center-weighted panorama and compare these pixel values to the input im- 
age values to estimate the vignetting function. Weight pixels in slowly varying regions 
more highly, as small misalignments will give large errors at high gradients. Option- 
ally estimate the radiometric response function as well (Litvinov and Schechner 2005; 
Goldman 2011). 
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3. Analyze the radial gradients (especially in low-gradient regions) and fit the robust 
means of these gradients to the derivative of the vignetting function, as described by 
Zheng, Yu, Kang et al. (2008). 

For the parametric form of your vignetting function, you can either use a simple radial func- 
tion, e.g., 

f(r) = 1 + r + a^r 1 + • • • (10.42) 

or one of the specialized equations developed by Kang and Weiss (2000) and Zheng, Lin, and 
Kang (2006). 

In all of these cases, be sure that you are using linearized intensity measurements, by 
using either RAW images or images linearized through a radiometric response function, or at 
least images where the gamma curve has been removed. 

(Optional) What happens if you forget to undo the gamma before fitting a (multiplicative) 
vignetting function? 

Ex 10.4: Optical blur (PSF) estimation Compute the optical PSF either using a known 
target (Figure 10.7) or by detecting and fitting step edges (Section 10.1.4) (Joshi, Szeliski, 
and Kriegman 2008). 

1 . Detect strong edges to sub-pixel precision. 

2. Fit a local profile to each oriented edge and fill these pixels into an ideal target image, 
either at image resolution or at a higher resolution (Figure 10.9c-d). 

3. Use least squares (10.1) at valid pixels to estimate the PSF kernel K. either globally or 
in locally overlapping sub-regions of the image. 

4. Visualize the recovered PSFs and use them to remove chromatic aberration or de-blur 
the image. 

Ex 10.5: Tone mapping Implement one of the tone mapping algorithms discussed in Sec- 
tion 10.2.1 (Durand and Dorsey 2002; Fattal, Lischinski, and Werman 2002; Reinhard, Stark, 
Shirley et al. 2002; Lischinski, Farbman, Uyttendaele et al. 2006b) or any of the numer- 
ous additional algorithms discussed by Reinhard, Ward, Pattanaik et al. (2005) and http: 
// stellar.mit .edu/S/course/6/sp08/6 . 8 1 5/materials . html . 

(Optional) Compare your algorithm to local histogram equalization (Section 3.1.4). 

Ex 10.6: Flash enhancement Develop an algorithm to combine flash and non-flash pho- 
tographs to best effect. You can use ideas from Eisemann and Durand (2004) and Petschnigg, 
Agrawala, Hoppe et al. (2004) or anything else you think might work well. 
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Ex 10.7: Super-resolution Implement one or more super-resolution algorithms and com- 
pare their performance. 

1 . Take a set of photographs of the same scene using a hand-held camera (to ensure that 
there is some jitter between the photographs). 

2. Determine the PSF for the images you are trying to super-resolve using one of the 
techniques in Exercise 10.4. 

3. Alternatively, simulate a collection of lower-resolution images by taking a high-quality 
photograph (avoid those with compression artifacts) and applying your own pre-filter 
kernel and downsampling. 

4. Estimate the relative motion between the images using a parametric translation and 
rotation motion estimation algorithm (Sections 6.1.3 or 8.2). 

5. Implement a basic least squares super-resolution algorithm by minimizing the differ- 
ence between the observed and downsampled images (10.27-10.28). 

6. Add in a gradient image prior, either as another least squares term or as a robust term 
that can be minimized using iteratively reweighted least squares (Appendix A. 3). 

7. (Optional) Implement one of the example-based super-resolution techniques, where 
matching against a set of exemplar images is used either to infer higher-frequency 
information to be added to the reconstruction (Freeman, Jones, and Pasztor 2002) 
or higher-frequency gradients to be matched in the super-resolved image (Baker and 
Kanade 2002). 

8. (Optional) Use local edge statistic information to improve the quality of the super- 
resolved image (Fattal 2007). 

Ex 10.8: Image matting Develop an algorithm for pulling a foreground matte from natural 
images, as described in Section 10.4. 

1 . Make sure that the images you are taking are linearized (Exercise 10. 1 and Section 10. 1) 
and that your camera exposure is fixed (full manual mode), at least when taking multi- 
ple shots of the same scene. 

2. To acquire ground truth data, place your object in front of a computer monitor and 
display a variety of solid background colors as well as some natural imagery. 

3. Remove your object and re -display the same images to acquire known background 
colors. 
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4. Use triangulation matting (Smith and Blinn 1996) to estimate the ground truth opacities 
a and pre-multiplied foreground colors aF for your objects. 

5. Implement one or more of the natural image matting algorithms described in Sec- 
tion 10.4 and compare your results to the ground truth values you computed. Alter- 
natively, use the matting test images published on http://alphamatting.com/. 

6. (Optional) Run your algorithms on other images taken with the same calibrated camera 
(or other images you find interesting). 

Ex 10.9: Smoke and shadow matting Extract smoke or shadow mattes from one scene 
and insert them into another (Chuang, Agarwala, Curless et al. 2002; Chuang, Goldman, 
Curless et al. 2003). 

1 . Take a still or video sequence of images with and without some intermittent smoke and 
shadows. (Remember to linearize your images before proceeding with any computa- 
tions.) 

2. For each pixel, fit a line to the observed color values. 

3. If performing smoke matting, robustly compute the intersection of these lines to obtain 
the smoke color estimate. Then, estimate the background color as the other extremum 
(unless you already took a smoke-free background image). 

If performing shadow matting, compute robust shadow (minimum) and lit (maximum) 
values for each pixel. 

4. Extract the smoke or shadow mattes from each frame as the fraction between these two 
values (background and smoke or shadowed and lit). 

5. Scan a new (destination) scene or modify the original background with an image editor. 

6. Re-insert the smoke or shadow matte, along with any other foreground objects you may 
have extracted. 

7. (Optional) Using a series of cast stick shadows, estimate the deformation field for the 
destination scene in order to correctly warp (drape) the shadows across the new ge- 
ometry. (This is related to the shadow scanning technique developed by Bouguet and 
Perona (1999) and implemented in Exercise 12.2.) 

8. (Optional) Chuang, Goldman, Curless et al. (2003) only demonstrated their technique 
for planar source geometries. Can you extend their technique to capture shadows ac- 
quired from an irregular source geometry? 
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9. (Optional) Can you change the direction of the shadow, i.e., simulate the effect of 
changing the light source direction? 

Ex 10.10: Texture synthesis Implement one of the texture synthesis or hole filling algo- 
rithms presented in Section 10.5. Here is one possible procedure: 

1. Implement the basic Efros and Leung (1999) algorithm, i.e., starting from the outside 
(for hole filling) or in raster order (for texture synthesis), search for a similar neighbor- 
hood in the source texture image, and copy that pixel. 

2. Add in the Wei and Levoy (2000) extension of generating the pixels in a coarse-to-fine 
fashion, i.e., generate a lower-resolution synthetic texture (or filled image), and use this 
as a guide for matching regions in the finer resolution version. 

3. Add in the Criminisi, Perez, and Toyama (2004) idea of prioritizing pixels to be filled 
by some function of the local structure (gradient or orientation strength). 

4. Extend any of the above algorithms by selecting sub-blocks in the source texture and 
using optimization to determine the seam between the new block and the existing image 
that it overlaps (Efros and Freeman 2001). 

5. (Optional) Implement one of the isophote (smooth continuation) inpainting algorithms 
(Bertalmio, Sapiro, Caselles el al. 2000; Telea 2004). 

6. (Optional) Add the ability to supply a target (reference) image (Efros and Freeman 
2001) or to provide sample filtered or unfiltered (reference and rendered) images (Hertz - 
mann, Jacobs, Oliver et al. 2001), see Section 10.5.2. 

Ex 10.11: Colorization Implement the Levin, Lischinski, and Weiss (2004) colorization al- 
gorithm that is sketched out in Section 10.3.2 and Figure 10.37. Find some historic monochrome 
photographs and some modern color ones. Write an interactive tool that lets you “pick” col- 
ors from a modern photo and paint over the old one. Tune the algorithm parameters to give 
you good results. Are you pleased with the results? Can you think of ways to make them 
look more “antique”, e.g., with softer (less saturated and edgy) colors? 
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Figure 11.1 Stereo reconstruction techniques can convert (a-b) a pair of images into (c) 
a depth map (http://vision.middlebury.edu/stereo/data/scenes2003/) or (d-e) a sequence of 
images into (f) a 3D model (http://vision.middlebury.edu/mview/data/). (g) An analytical 
stereo plotter, courtesy of Kenney Aerial Mapping, Inc., can generate (h) contour plots. 
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Stereo matching is the process of taking two or more images and estimating a 3D model of 
the scene by finding matching pixels in the images and converting their 2D positions into 
3D depths. In Chapters 6-7, we described techniques for recovering camera positions and 
building sparse 3D models of scenes or objects. In this chapter, we address the question 
of how to build a more complete 3D model, e.g., a sparse or dense depth map that assigns 
relative depths to pixels in the input images. We also look at the topic of multi-view stereo 
algorithms that produce complete 3D volumetric or surface-based object models. 

Why are people interested in stereo matching? From the earliest inquiries into visual per- 
ception, it was known that we perceive depth based on the differences in appearance between 
the left and right eye. 1 As a simple experiment, hold your finger vertically in front of your 
eyes and close each eye alternately. You will notice that the finger jumps left and right relative 
to the background of the scene. The same phenomenon is visible in the image pair shown in 
Figure 11.1 a-b, in which the foreground objects shift left and right relative to the background. 

As we will shortly see, under simple imaging configurations (both eyes or cameras look- 
ing straight ahead), the amount of horizontal motion or disparity is inversely proportional to 
the distance from the observer. While the basic physics and geometry relating visual disparity 
to scene structure are well understood (Section 11.1), automatically measuring this disparity 
by establishing dense and accurate inter-image correspondences is a challenging task. 

The earliest stereo matching algorithms were developed in the field of photogrammetry 
for automatically constructing topographic elevation maps from overlapping aerial images. 
Prior to this, operators would use photogrammetric stereo plotters, which displayed shifted 
versions of such images to each eye and allowed the operator to float a dot cursor around con- 
stant elevation contours (Figure 1 1. lg). The development of fully automated stereo matching 
algorithms was a major advance in this field, enabling much more rapid and less expensive 
processing of aerial imagery (Hannah 1974; Hsieh, McKeown, and Perlant 1992). 

In computer vision, the topic of stereo matching has been one of the most widely stud- 
ied and fundamental problems (Marr and Poggio 1976; Barnard and Fischler 1982; Dhond 
and Aggarwal 1989; Scharstein and Szeliski 2002; Brown, Burschka, and Hager 2003; Seitz, 
Curless, Diebel et al. 2006), and continues to be one of the most active research areas. While 
photogrammetric matching concentrated mainly on aerial imagery, computer vision applica- 
tions include modeling the human visual system (Marr 1982), robotic navigation and manip- 
ulation (Moravec 1983; Konolige 1997; Thmn, Montemerlo, Dahlkamp et al. 2006), as well 
as view interpolation and image-based rendering (Figure 1 1 .2a-d), 3D model building (Fig- 
ure 1 1.2e-f and h-j), and mixing live action with computer-generated imagery (Figure 1 1.2g). 

In this chapter, we describe the fundamental principles behind stereo matching, following 

1 The word stereo comes from the Greek for solid ; stereo vision is how we perceive solid shape (Koenderink 
1990 ). 
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Figure 11.2 Applications of stereo vision: (a) input image, (b) computed depth map, and (c) 
new view generation from multi-view stereo (Matthies, Kanade, and Szeliski 1989) © 1989 
Springer; (d) view morphing between two images (Seitz and Dyer 1996) © 1996 ACM; (e-f) 
3D face modeling (images courtesy of Frederic Devernay); (g) z-keying live and computer- 
generated imagery (Kanade, Yoshida, Oda et al. 1996) © 1996 IEEE; (h-j) building 3D 
surface models from multiple video streams in Virtualized Reality (Kanade, Rander, and 
Narayanan 1997). 
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the general taxonomy proposed by Scharstein and Szeliski (2002). We begin in Section 11.1 
with a review of the geometry of stereo image matching, i.e., how to compute for a given 
pixel in one image the range of possible locations the pixel might appear at in the other 
image, i.e., its epipolar line. We describe how to pre-warp images so that corresponding 
epipolar lines are coincident ( rectification ). We also describe a general resampling algorithm 
called plane sweep that can be used to perform multi-image stereo matching with arbitrary 
camera configurations. 

Next, we briefly survey techniques for the sparse stereo matching of interest points and 
edge-like features (Section 11.2). We then turn to the main topic of this chapter, namely the 
estimation of a dense set of pixel-wise correspondences in the form of a disparity map (Fig- 
ure 11.1c). This involves first selecting a pixel matching criterion (Section 11.3) and then 
using either local area-based aggregation (Section 1 1 .4) or global optimization (Section 1 1 .5) 
to help disambiguate potential matches. In Section 11.6, we discuss multi-view stereo meth- 
ods that aim to reconstruct a complete 3D model instead of just a single disparity image 
(Figure ll.ld-f). 


11.1 Epipolar geometry 

Given a pixel in one image, how can we compute its correspondence in the other image? In 
Chapter 8, we saw that a variety of search techniques can be used to match pixels based on 
their local appearance as well as the motions of neighboring pixels. In the case of stereo 
matching, however, we have some additional information available, namely the positions and 
calibration data for the cameras that took the pictures of the same static scene (Section 7.2). 

How can we exploit this information to reduce the number of potential correspondences, 
and hence both speed up the matching and increase its reliability? Figure 1 1.3a shows how a 
pixel in one image Xq projects to an epipolar line segment in the other image. The segment 
is bounded at one end by the projection of the original viewing ray at infinity p ^ and at the 
other end by the projection of the original camera center Co into the second camera, which 
is known as the epipole e-| . If we project the epipolar line in the second image back into the 
first, we get another line (segment), this time bounded by the other corresponding epipole 
ft () . Extending both line segments to infinity, we get a pair of corresponding epipolar lines 
(Figure 11.3b), which are the intersection of the two image planes with the epipolar plane 
that passes through both camera centers Co and c\ as well as the point of interest p (Faugeras 
and Luong 2001; Hartley and Zisserman 2004). 
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Figure 11.3 Epipolar geometry: (a) epipolar line segment corresponding to one ray; (b) 
corresponding set of epipolar lines and their epipolar plane. 

11.1.1 Rectification 

As we saw in Section 7.2, the epipolar geometry for a pair of cameras is implicit in the 
relative pose and calibrations of the cameras, and can easily be computed from seven or more 
point matches using the fundamental matrix (or five or more points for the calibrated essential 
matrix) (Zhang 1998a,b; Faugeras and Luong 2001; Hartley and Zisserman 2004). Once this 
geometry has been computed, we can use the epipolar line corresponding to a pixel in one 
image to constrain the search for corresponding pixels in the other image. One way to do this 
is to use a general correspondence algorithm, such as optical flow (Section 8.4), but to only 
consider locations along the epipolar line (or to project any flow vectors that fall off back onto 
the line). 

A more efficient algorithm can be obtained by first rectifying (i.e, warping) the input 
images so that corresponding horizontal scanlines are epipolar lines (Loop and Zhang 1999; 
Faugeras and Luong 2001; Hartley and Zisserman 2004). 2 Afterwards, it is possible to match 
horizontal scanlines independently or to shift images horizontally while computing matching 
scores (Figure 11.4). 

A simple way to rectify the two images is to first rotate both cameras so that they are 
looking perpendicular to the line joining the camera centers c {) and C\. Since there is a de- 
gree of freedom in the tilt, the smallest rotations that achieve this should be used. Next, to 
determine the desired twist around the optical axes, make the up vector (the camera y axis) 

2 This makes most sense if the cameras are next to each other, although by rotating the cameras, rectification can 
be performed on any pair that is not verged too much or has too much of a scale change. In those latter cases, using 
plane sweep (below) or hypothesizing small planar patch locations in 3D (Goesele, Snavely, Curless et at. 2007) may 
be preferable. 
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(c) 


(d) 


Figure 11.4 The multi-stage stereo rectification algorithm of Loop and Zhang (1999) © 
1999 IEEE, (a) Original image pair overlaid with several epipolar lines; (b) images trans- 
formed so that epipolar lines are parallel; (c) images rectified so that epipolar lines are hori- 
zontal and in vertial correspondence; (d) final rectification that minimizes horizontal distor- 
tions. 

perpendicular to the camera center line. This ensures that corresponding epipolar lines are 
horizontal and that the disparity for points at infinity is 0. Finally, re-scale the images, if nec- 
essary, to account for different focal lengths, magnifying the smaller image to avoid aliasing. 
(The full details of this procedure can be found in Fusiello, Trucco, and Verri (2000) and Ex- 
ercise 11.1.) Note that in general, it is not possible to rectify an arbitrary collection of images 
simultaneously unless their optical centers are collinear, although rotating the cameras so that 
they all point in the same direction reduces the inter-camera pixel movements to scalings and 
translations. 

The resulting standard rectified geometry is employed in a lot of stereo camera setups and 
stereo algorithms, and leads to a very simple inverse relationship between 3D depths Z and 
disparities d , 



(11.1) 


where / is the focal length (measured in pixels), B is the baseline, and 


x 


x + d(x,y), y' = y 


( 11 . 2 ) 


describes the relationship between corresponding pixel coordinates in the left and right im- 
ages (Bolles, Baker, and Marimont 1987; Okutomi and Kanade 1993; Scharstein and Szeliski 
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(a) (b) (c) (d) (e) 



(f) 


Figure 11.5 Slices through a typical disparity space image (DSI) (Scharstein and Szeliski 
2002) © 2002 Springer: (a) original color image; (b) ground truth disparities; (c-e) three 
( x,y ) slices for d = 10, 16,21; (f) an ( x,d ) slice for y = 151 (the dashed line in (b)). 
Various dark (matching) regions are visible in (c-e), e.g., the bookshelves, table and cans, 
and head statue, and three disparity levels can be seen as horizontal lines in (f). The dark 
bands in the DSIs indicate regions that match at this disparity. (Smaller dark regions are often 
the result of textureless regions.) Additional examples of DSIs are discussed by Bobick and 
Intille (1999). 

2002). 3 The task of extracting depth from a set of images then becomes one of estimating the 
disparity map d(x, y). 

After rectification, we can easily compare the similarity of pixels at corresponding lo- 
cations ( x,y ) and (x',y') = (x + d, y) and store them in a disparity space image (DSI) 
C(x, y, d) for further processing (Figure 1 1.5). The concept of the disparity space ( x , y, d) 
dates back to early work in stereo matching (Marr and Poggio 1976), while the concept of a 
disparity space image (volume) is generally associated with Yang, Yuille, and Lu (1993) and 
Intille and Bobick (1994). 


11.1.2 Plane sweep 

An alternative to pre-rectifying the images before matching is to sweep a set of planes through 
the scene and to measure the photoconsistency of different images as they are re-projected 
onto these planes (Figure 11.6). This process is commonly known as the plane sweep algo- 
rithm (Collins 1996; Szeliski and Golland 1999; Saito and Kanade 1999). 

As we saw in Section 2.1.5, where we introduced projective depth (also known as plane 
plus parallax (Kumar, Anandan, and Hanna 1994; Sawhney 1994; Szeliski and Coughlan 

3 The term disparity was first introduced in the human vision literature to describe the difference in location 
of corresponding features seen by the left and right eyes (Man* 1982). Horizontal disparity is the most commonly 
studied phenomenon, but vertical disparity is possible if the eyes are verged. 
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Figure 11.6 Sweeping a set of planes through a scene (Szeliski and Golland 1999) © 1999 
Springer: (a) The set of planes seen from a virtual camera induces a set of homographies in 
any other source (input) camera image, (b) The warped images from all the other cameras can 
be stacked into a generalized disparity space volume /( x, y. d. k) indexed by pixel location 
( x , y ), disparity d , and camera k. 


1997)), the last row of a full-rank 4x4 projection matrix P can be set to an arbitrary plane 
equation p 3 = S 3 [rio|co]. The resulting four-dimensional projective transform ( collineation ) 
(2.68) maps 3D world points p = (X, Y, Z, 1) into screen coordinates x s = (x s , y s , 1, d), 
where the projective depth (or parallax) d (2.66) is 0 on the reference plane (Figure 2.1 1). 

Sweeping d through a series of disparity hypotheses, as shown in Figure 11.6a, corre- 
sponds to mapping each input image into the virtual camera P defining the disparity space 
through a series of homographies (2.68-2.71), 

x k ~P k P 1 x s = H k x + t k d = (H k + t k [0 0 d))x, (11.3) 

as shown in Figure 2.12b, where x k and x are the homogeneous pixel coordinates in the 
source and virtual (reference) images (Szeliski and Golland 1999). The members of the fam- 
ily of homographies H k (d) = H k + t k [0 0 d], which are parametererized by the addition of 
a rank-1 matrix, are related to each other through a planar homology (Hartley and Zisserman 
2004, A5.2). 

The choice of virtual camera and parameterization is application dependent and is what 
gives this framework a lot of its flexibility. In many applications, one of the input cameras 
(the reference camera) is used, thus computing a depth map that is registered with one of the 
input images and which can later be used for image-based rendering (Sections 13.1 and 13.2). 
In other applications, such as view interpolation for gaze correction in video-conferencing 
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(Section 1 1.4.2) (Ott, Lewis, and Cox 1993; Criminisi, Shotton, Blake et al. 2003), a camera 
centrally located between the two input cameras is preferable, since it provides the needed 
per-pixel disparities to hallucinate the virtual middle image. 

The choice of disparity sampling, i.e., the setting of the zero parallax plane and the scaling 
of integer disparities, is also application dependent, and is usually set to bracket the range of 
interest, i.e., the working volume, while scaling disparities to sample the image in pixel (or 
sub-pixel) shifts. For example, when using stereo vision for obstacle avoidance in robot 
navigation, it is most convenient to set up disparity to measure per-pixel elevation above the 
ground (Ivanchenko, Shen, and Coughlan 2009). 

As each input image is warped onto the current planes parameterized by disparity d, it 
can be stacked into a generalized disparity space image I(x , y , d , k) for further processing 
(Figure 11.6b) (Szeliski and Golland 1999). In most stereo algorithms, the photoconsistency 
(e.g., sum of squared or robust differences) with respect to the reference image I r is calculated 
and stored in the DSI 


However, it is also possible to compute alternative statistics such as robust variance, focus, 
or entropy (Section 1 1.3.1) (Vaish, Szeliski, Zitnick et al. 2006) or to use this representation 
to reason about occlusions (Szeliski and Golland 1999; Kang and Szeliski 2004). The gen- 
eralized DSI will come in particularly handy when we come back to the topic of multi-view 
stereo in Section 11.6. 

Of course, planes are not the only surfaces that can be used to define a 3D sweep through 
the space of interest. Cylindrical surfaces, especially when coupled with panoramic photog- 
raphy (Chapter 9), are often used (Ishiguro, Yamamoto, and Tsuji 1992; Kang and Szeliski 
1997; Shum and Szeliski 1999; Li, Shum, Tang et al. 2004; Zheng, Kang, Cohen et al. 2007). 
It is also possible to define other manifold topologies, e.g., ones where the camera rotates 
around a fixed axis (Seitz 2001). 

Once the DSI has been computed, the next step in most stereo correspondence algorithms 
is to produce a univalued function in disparity space d(x, y) that best describes the shape of 
the surfaces in the scene. This can be viewed as finding a surface embedded in the disparity 
space image that has some optimality property, such as lowest cost and best (piecewise) 
smoothness (Yang, Yuille, and Lu 1993). Figure 11.5 shows examples of slices through a 
typical DSI. More figures of this kind can be found in the paper by Bobick and Intille (1999). 



(11.4) 
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11.2 Sparse correspondence 

Early stereo matching algorithms were feature-based, i.e., they first extracted a set of poten- 
tially matchable image locations, using either interest operators or edge detectors, and then 
searched for corresponding locations in other images using a patch-based metric (Hannah 
1974; Marr and Poggio 1979; Mayhew and Frisby 1980; Baker and Binford 1981; Arnold 
1983; Grimson 1985; Ohta and Kanade 1985; Bolles, Baker, and Marimont 1987; Matthies, 
Kanade, and Szeliski 1989; Hsieh, McKeown, and Perlant 1992; Bolles, Baker, and Hannah 
1993). This limitation to sparse correspondences was partially due to computational resource 
limitations, but was also driven by a desire to limit the answers produced by stereo algorithms 
to matches with high certainty. In some applications, there was also a desire to match scenes 
with potentially very different illuminations, where edges might be the only stable features 
(Collins 1996). Such sparse 3D reconstructions could later be interpolated using surface fit- 
ting algorithms such as those discussed in Sections 3.7.1 and 12.3.1. 

More recent work in this area has focused on first extracting highly reliable features and 
then using these as seeds to grow additional matches (Zhang and Shan 2000; Lhuillier and 
Quan 2002). Similar approaches have also been extended to wide baseline multi-view stereo 
problems and combined with 3D surface reconstruction (Lhuillier and Quan 2005; Strecha, 
Tuytelaars, and Van Gool 2003; Goesele, Snavely, Curless et al. 2007) or free-space reasoning 
(Taylor 2003), as described in more detail in Section 11.6. 


11.2.1 3D curves and profiles 

Another example of sparse correspondence is the matching of profile curves (or occluding 
contours), which occur at the boundaries of objects (Figure 11.7) and at interior self occlu- 
sions, where the surface curves away from the camera viewpoint. 

The difficulty in matching profile curves is that in general, the locations of profile curves 
vary as a function of camera viewpoint. Therefore, matching curves directly in two images 
and then triangulating these matches can lead to erroneous shape measurements. Fortunately, 
if three or more closely spaced frames are available, it is possible to fit a local circular arc to 
the locations of corresponding edgels (Figure 11.7a) and therefore obtain semi-dense curved 
surface meshes directly from the matches (Figures 1 1.7c and g). Another advantage of match- 
ing such curves is that they can be used to reconstruct surface shape for untextured surfaces, 
so long as there is a visible difference between foreground and background colors. 

Over the years, a number of different techniques have been developed for reconstructing 
surface shape from profile curves (Giblin and Weiss 1987; Cipolla and Blake 1992; Vaillant 
and Faugeras 1992; Zheng 1994; Boyer and Berger 1997; Szeliski and Weiss 1998). Cipolla 
and Giblin (2000) describe many of these techniques, as well as related topics such as in- 
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Figure 11.7 Surface reconstruction from occluding contours (Szeliski and Weiss 1998) © 
2002 Springer: (a) circular arc fitting in the epipolar plane; (b) synthetic example of an el- 
lipsoid with a truncated side and elliptic surface markings; (c) partially reconstructed surface 
mesh seen from an oblique and top-down view; (d) real-world image sequence of a soda can 
on a turntable; (e) extracted edges; (f) partially reconstructed profile curves; (g) partially re- 
constructed surface mesh. (Partial reconstructions are shown so as not to clutter the images.) 


ferring camera motion from profile curve sequences. Below, we summarize the approach 
developed by Szeliski and Weiss (1998), which assumes a discrete set of images, rather than 
formulating the problem in a continuous differential framework. 

Let us assume that the camera is moving smoothly enough that the local epipolar geometry 
varies slowly, i.e., the epipolar planes induced by the successive camera centers and an edgel 
under consideration are nearly co-planar. The first step in the processing pipeline is to extract 
and link edges in each of the input images (Figures 11.7b and e). Next, edgels in successive 
images are matched using pairwise epipolar geometry, proximity and (optionally) appearance. 
This provides a linked set of edges in the spatio-temporal volume, which is sometimes called 
the weaving wall (Baker 1989). 

To reconstruct the 3D location of an individual edgel, along with its local in-plane normal 
and curvature, we project the viewing rays corresponding to its neighbors onto the instanta- 
neous epipolar plane defined by the camera center, the viewing ray, and the camera velocity, 
as shown in Figure 1 1.7a. We then fit an osculating circle to the projected lines, parameteriz- 
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ing the circle by its centerpoint c = (x c , y c ) and radius r, 

CiX c + Siy c + r = di, (11.5) 

where c* = 1 \ • t 0 and Sj = — • h n are the cosine and sine of the angle between viewing ray 

i and the central viewing ray 0, and di = {q -, — q Q ) ■ no is the perpendicular distance between 
viewing ray i and the local origin q 0 , which is a point chosen on the central viewing ray close 
to the line intersections (Szeliski and Weiss 1998). The resulting set of linear equations can 
be solved using least squares, and the quality of the solution (residual error) can be used to 
check for erroneous correspondences. 

The resulting set of 3D points, along with their spatial (in-image) and temporal (between- 
image) neighbors, form a 3D surface mesh with local normal and curvature estimates (Fig- 
ures 1 1.7c and g). Note that whenever a curve is due to a surface marking or a sharp crease 
edge, rather than a smooth surface profile curve, this shows up as a 0 or small radius of curva- 
ture. Such curves result in isolated 3D space curves, rather than elements of smooth surface 
meshes, but can still be incorporated into the 3D surface model during a later stage of surface 
interpolation (Section 12.3.1). 


11.3 Dense correspondence 

While sparse matching algorithms are still occasionally used, most stereo matching algo- 
rithms today focus on dense correspondence, since this is required for applications such as 
image-based rendering or modeling. This problem is more challenging than sparse corre- 
spondence, since inferring depth values in textureless regions requires a certain amount of 
guesswork. (Think of a solid colored background seen through a picket fence. What depth 
should it be?) 

In this section, we review the taxonomy and categorization scheme for dense correspon- 
dence algorithms first proposed by Scharstein and Szeliski (2002). The taxonomy consists 
of a set of algorithmic “building blocks” from which a large set of algorithms can be con- 
structed. It is based on the observation that stereo algorithms generally perform some subset 
of the following four steps: 

1. matching cost computation; 

2. cost (support) aggregation; 

3. disparity computation and optimization; and 


4. disparity refinement. 
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For example, local (window-based) algorithms (Section 1 1 .4), where the disparity com- 
putation at a given point depends only on intensity values within a finite window, usually 
make implicit smoothness assumptions by aggregating support. Some of these algorithms 
can cleanly be broken down into steps 1,2, 3. For example, the traditional sum-of-squared- 
differences (SSD) algorithm can be described as: 

1 . The matching cost is the squared difference of intensity values at a given disparity. 

2. Aggregation is done by summing the matching cost over square windows with constant 
disparity. 

3. Disparities are computed by selecting the minimal (winning) aggregated value at each 
pixel. 

Some local algorithms, however, combine steps 1 and 2 and use a matching cost that is based 
on a support region, e.g. normalized cross-correlation (Hannah 1974; Bolles, Baker, and Han- 
nah 1993) and the rank transform (Zabih and Woodfill 1994) and other ordinal measures (Bhat 
and Nayar 1998). (This can also be viewed as a preprocessing step; see (Section 1 1.3.1).) 

Global algorithms, on the other hand, make explicit smoothness assumptions and then 
solve a a global optimization problem (Section 11.5). Such algorithms typically do not per- 
form an aggregation step, but rather seek a disparity assignment (step 3) that minimizes a 
global cost function that consists of data (step 1) terms and smoothness terms. The main dis- 
tinctions among these algorithms is the minimization procedure used, e.g., simulated anneal- 
ing (Marroquin, Mitter, and Poggio 1987; Barnard 1989), probabilistic (mean-field) diffusion 
(Scharstein and Szeliski 1998), expectation maximization (EM) (Birchfield, Natarajan, and 
Tomasi 2007), graph cuts (Boykov, Veksler, and Zabih 2001), or loopy belief propagation 
(Sun, Zheng, and Shum 2003), to name just a few. 

In between these two broad classes are certain iterative algorithms that do not explicitly 
specify a global function to be minimized, but whose behavior mimics closely that of iterative 
optimization algorithms (Marr and Poggio 1976; Zitnick and Kanade 2000). Hierarchical 
(coarse-to-fine) algorithms resemble such iterative algorithms, but typically operate on an 
image pyramid where results from coarser levels are used to constrain a more local search at 
finer levels (Witkin, Terzopoulos, and Kass 1987; Quam 1984; Bergen, Anandan, Hanna et 
al. 1992). 


11.3.1 Similarity measures 

The first component of any dense stereo matching algorithm is a similarity measure that 
compares pixel values in order to determine how likely they are to be in correspondence. In 
this section, we briefly review the similarity measures introduced in Section 8. 1 and mention a 
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few others that have been developed specifically for stereo matching (Scharstein and Szeliski 
2002; Hirschmiillcr and Scharstein 2009). 

The most common pixel-based matching costs include sums of squared intensity differ- 
ences (SSD) (Hannah 1974) and absolute intensity differences (SAD) (Kanade 1994). In 
the video processing community, these matching criteria are referred to as the mean-squared 
error (MSE) and mean absolute difference (MAD) measures; the term displaced frame dif- 
ference is also often used (Tekalp 1995). 

More recently, robust measures (8.2), including truncated quadratics and contaminated 
Gaussians, have been proposed (Black and Anandan 1996; Black and Rangarajan 1996; 
Scharstein and Szeliski 1998). These measures are useful because they limit the influence 
of mismatches during aggregation. Vaish, Szeliski, Zitnick el al. (2006) compare a number 
of such robust measures, including a new one based on the entropy of the pixel values at a 
particular disparity hypothesis (Zitnick, Kang, Uyttendaele et al. 2004), which is particularly 
useful in multi-view stereo. 

Other traditional matching costs include normalized cross-correlation (8.11) (Hannah 
1974; Bolles, Baker, and Hannah 1993; Evangelidis and Psarakis 2008), which behaves 
similarly to sum-of-squared-differences (SSD), and binary matching costs (i.e., match or no 
match) (Marr and Poggio 1976), based on binary features such as edges (Baker and Binford 
1981; Grimson 1985) or the sign of the Laplacian (Nishihara 1984). Because of their poor 
discriminability, simple binary matching costs are no longer used in dense stereo matching. 

Some costs are insensitive to differences in camera gain or bias, for example gradient- 
based measures (Seitz 1989; Scharstein 1994), phase and filter-bank responses (Marr and 
Poggio 1979; Kass 1988; Jenkin, Jepson, and Tsotsos 1991; Jones and Malik 1992), filters 
that remove regular or robust (bilaterally filtered) means (Ansar, Castano, and Matthies 2004; 
Hirschmiiller and Scharstein 2009), dense feature descriptor (Tola, Lepetit, and Fua 2010), 
and non-parametric measures such as rank and census transforms (Zabih and Woodfill 1994), 
ordinal measures (Bhat and Nayar 1998), or entropy (Zitnick, Kang, Uyttendaele et al. 2004; 
Zitnick and Kang 2007). The census transform, which converts each pixel inside a moving 
window into a bit vector representing which neighbors are above or below the central pixel, 
was found by Hirschmiiller and Scharstein (2009) to be quite robust against large-scale, non- 
stationary exposure and illumination changes. 

It is also possible to correct for differing global camera characteristics by performing 
a preprocessing or iterative refinement step that estimates inter-image bias-gain variations 
using global regression (Gennert 1988), histogram equalization (Cox, Roy, and Hingorani 
1995), or mutual information (Kim, Kolmogorov, and Zabih 2003; Hirschmiiller 2008). Lo- 
cal, smoothly varying compensation fields have also been proposed (Strecha, Tuytelaars, and 
Van Gool 2003; Zhang, McMillan, and Yu 2006). 

In order to compensate for sampling issues, i.e., dramatically different pixel values in 
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high-frequency areas, Birchfield and Tomasi (1998) proposed a matching cost that is less sen- 
sitive to shifts in image sampling. Rather than just comparing pixel values shifted by integral 
amounts (which may miss a valid match), they compare each pixel in the reference image 
against a linearly interpolated function of the other image. More detailed studies of these 
and additional matching costs are explored in (Szeliski and Scharstein 2004; Hirschmiiller 
and Scharstein 2009). In particular, if you expect there to be significant exposure or appear- 
ance variation between images that you are matching, some of the more robust measures 
that performed well in the evaluation by Hirschmiiller and Scharstein (2009), such as the 
census transform (Zabih and Woodfill 1994), ordinal measures (Bhat and Nayar 1998), bi- 
lateral subtraction (Ansar, Castano, and Matthies 2004), or hierarchical mutual information 
(Hirschmiiller 2008), should be used. 

11.4 Local methods 

Local and window-based methods aggregate the matching cost by summing or averaging 
over a support region in the DSI C(x, y, d) 4 A support region can be either two-dimensional 
at a fixed disparity (favoring fronto-parallel surfaces), or three-dimensional in x-y-d space 
(supporting slanted surfaces). Two-dimensional evidence aggregation has been implemented 
using square windows or Gaussian convolution (traditional), multiple windows anchored at 
different points, i.e., shiftable windows (Arnold 1983; Fusiello, Roberto, and Trucco 1997; 
Bobick and Intille 1999), windows with adaptive sizes (Okutomi and Kanade 1992; Kanade 
and Okutomi 1994; Kang, Szeliski, and Chai 2001; Veksler 2001, 2003), windows based on 
connected components of constant disparity (Boykov, Veksler, and Zabih 1998), or the re- 
sults of color-based segmentation (Yoon and Kweon 2006; Tombari, Mattoccia, Di Stefano 
et al. 2008). Three-dimensional support functions that have been proposed include limited 
disparity difference (Grimson 1985), limited disparity gradient (Pollard, Mayhew, and Frisby 
1985), Prazdny’s coherence principle (Prazdny 1985), and the more recent work (which in- 
cludes visibility and occlusion reasoning) by Zitnick and Kanade (2000). 

Aggregation with a fixed support region can be performed using 2D or 3D convolution, 

C(x,y,d) = w{x,y,d ) * C 0 (x,y,d), (11.6) 

or, in the case of rectangular windows, using efficient moving average box-filters (Sec- 
tion 3.2.2) (Kanade, Yoshida, Oda et al. 1996; Kimura, Shinbo, Yamaguchi et al. 1999). 
Shiftable windows can also be implemented efficiently using a separable sliding min-filter 
(Figure 11.8) (Scharstein and Szeliski 2002, Section 4.2). Selecting among windows of dif- 
ferent shapes and sizes can be performed more efficiently by first computing a summed area 

4 For two recent surveys and comparisons of such techniques, please see the work of Gong, Yang, Wang et al. 
(2007) and Tombari. Mattoccia, Di Stefano et al. (2008). 
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Figure 11.8 Shiftable window (Scharstein and Szeliski 2002) © 2002 Springer. The effect 
of trying all 3 x 3 shifted windows around the black pixel is the same as taking the minimum 
matching score across all centered (non-shifted) windows in the same neighborhood. (For 
clarity, only three of the neighboring shifted windows are shown here.) 



(a) (b) (c) (d) 


Figure 11.9 Aggregation window sizes and weights adapted to image content (Tombari, 
Mattoccia, Di Stefano et al. 2008) © 2008 IEEE: (a) original image with selected evaluation 
points; (b) variable windows (Veksler 2003); (c) adaptive weights (Yoon and Kweon 2006); 
(d) segmentation-based (Tombari, Mattoccia, and Di Stefano 2007). Notice how the adaptive 
weights and segmentation-based techniques adapt their support to similarly colored pixels. 

table (Section 3.2.3, 3.30-3.32) (Veksler 2003). Selecting the right window is important, 
since windows must be large enough to contain sufficient texture and yet small enough so 
that they do not straddle depth discontinuities (Figure 11.9). An alternative method for ag- 
gregation is iterative diffusion, i.e., repeatedly adding to each pixel’s cost the weighted values 
of its neighboring pixels’ costs (Szeliski and Hinton 1985; Shah 1993; Scharstein and Szeliski 
1998). 

Of the local aggregation methods compared by Gong, Yang, Wang et al. (2007) and 
Tombari, Mattoccia, Di Stefano et al. (2008), the fast variable window approach of Vek- 
sler (2003) and the locally weighting approach developed by Yoon and Kweon (2006) con- 
sistently stood out as having the best tradeoff between performance and speed. 5 The local 
weighting technique, in particular, is interesting because, instead of using square windows 
with uniform weighting, each pixel within an aggregation window influences the final match- 

5 More recent and extensive results from Tombari, Mattoccia, Di Stefano et al. (2008) can be found at http: 
//www. vision. deis.unibo.it/spe/SPEHome.aspx. 
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ing cost based on its color similarity and spatial distance, just as in bilinear filtering (Fig- 
ure 11.9c). (In fact, their aggregation step is closely related to doing a joint bilateral filter 
on the color/disparity image, except that it is done symmetrically in both reference and target 
images.) The segmentation-based aggregation method of Tombari, Mattoccia, and Di Stefano 
(2007) did even better, although a fast implementation of this algorithm does not yet exist. 

In local methods, the emphasis is on the matching cost computation and cost aggregation 
steps. Computing the final disparities is trivial: simply choose at each pixel the disparity 
associated with the minimum cost value. Thus, these methods perform a local “winner- 
take-all” (WTA) optimization at each pixel. A limitation of this approach (and many other 
correspondence algorithms) is that uniqueness of matches is only enforced for one image 
(the reference image), while points in the other image might match multiple points, unless 
cross-checking and subsequent hole filling is used (Fua 1993; Hirschmiiller and Scharstein 
2009). 

11.4.1 Sub-pixel estimation and uncertainty 

Most stereo correspondence algorithms compute a set of disparity estimates in some dis- 
cretized space, e.g., for integer disparities (exceptions include continuous optimization tech- 
niques such as optical flow (Bergen, Anandan, Hanna et al. 1992) or splines (Szeliski and 
Coughlan 1997)). For applications such as robot navigation or people tracking, these may be 
perfectly adequate. However for image-based rendering, such quantized maps lead to very 
unappealing view synthesis results, i.e., the scene appears to be made up of many thin shear- 
ing layers. To remedy this situation, many algorithms apply a sub-pixel refinement stage after 
the initial discrete correspondence stage. (An alternative is to simply start with more discrete 
disparity levels (Szeliski and Scharstein 2004).) 

Sub-pixel disparity estimates can be computed in a variety of ways, including iterative 
gradient descent and fitting a curve to the matching costs at discrete disparity levels (Ryan, 
Gray, and Hunt 1980; Lucas and Kanade 1981; Tian and Huhns 1986; Matthies, Kanade, 
and Szeliski 1989; Kanade and Okutomi 1994). This provides an easy way to increase the 
resolution of a stereo algorithm with little additional computation. However, to work well, 
the intensities being matched must vary smoothly, and the regions over which these estimates 
are computed must be on the same (correct) surface. 

Recently, some questions have been raised about the advisability of fitting correlation 
curves to integer-sampled matching costs (Shimizu and Okutomi 2001). This situation may 
even be worse when sampling-insensitive dissimilarity measures are used (Birchfield and 
Tomasi 1998). These issues are explored in more depth by Szeliski and Scharstein (2004). 

Besides sub-pixel computations, there are other ways of post-processing the computed 
disparities. Occluded areas can be detected using cross-checking, i.e., comparing left-to- 
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(a) (b) (c) 


Figure 11.10 Uncertainty in stereo depth estimation (Szeliski 1991b): (a) input image; (b) 
estimated depth map (blue is closer); (c) estimated confidence(red is higher). As you can see, 
more textured areas have higher confidence. 


right and right-to-left disparity maps (Fua 1993). A median filter can be applied to clean 
up spurious mismatches, and holes due to occlusion can be filled by surface fitting or by 
distributing neighboring disparity estimates (Birchfield and Tomasi 1999; Scharstein 1999; 
Hirschmiiller and Scharstein 2009). 

Another kind of post-processing, which can be useful in later processing stages, is to asso- 
ciate confidences with per-pixel depth estimates (Figure 11.10), which can be done by looking 
at the curvature of the correlation surface, i.e., how strong the minimum in the DSI image is 
at the winning disparity. Matthies, Kanade, and Szeliski (1989) show that under the assump- 
tion of small noise, photometrically calibrated images, and densely sampled disparities, the 
variance of a local depth estimate can be estimated as 

2 

Var(d) = (11.7) 

where a is the curvature of the DSI as a function of d, which can be measured using a local 
parabolic fit or by squaring all the horizontal gradients in the window, and cr| is the vari- 
ance of the image noise, which can be estimated from the minimum SSD score. (See also 
Section 6.1.4, (8.44), and Appendix B.6.) 

11.4.2 Application : Stereo-based head tracking 

A common application of real-time stereo algorithms is for tracking the position of a user 
interacting with a computer or game system. The use of stereo can dramatically improve 
the reliability of such a system compared to trying to use monocular color and intensity 
information (Darrell, Gordon, Harville el al. 2000). Once recovered, this information can 
be used in a variety of applications, including controlling a virtual environment or game, 
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correcting the apparent gaze during video conferencing, and background replacement. We 
discuss the first two applications below and defer the discussion of background replacement 
to Section 1 1.5.3. 

The use of head tracking to control a user’s virtual viewpoint while viewing a 3D object 
or environment on a computer monitor is sometimes cal led /f.v/z tank virtual reality, since the 
user is observing a 3D world as if it were contained inside a fish tank (Ware, Arthur, and 
Booth 1993). Early versions of these systems used mechanical head tracking devices and 
stereo glasses. Today, such systems can be controlled using stereo-based head tracking and 
stereo glasses can be replaced with autostereoscopic displays. Head tracking can also be used 
to construct a “virtual mirror”, where the user’s head can be modified in real-time using a 
variety of visual effects (Darrell, Baker, Crow et al. 1997). 

Another application of stereo head tracking and 3D reconstruction is in gaze correction 
(Ott, Lewis, and Cox 1993). When a user participates in a desktop video-conference or video 
chat, the camera is usually placed on top of the monitor. Since the person is gazing at a 
window somewhere on the screen, it appears as if they are looking down and away from the 
other participants, instead of straight at them. Replacing the single camera with two or more 
cameras enables a virtual view to be constructed right at the position where they are looking 
resulting in virtual eye contact. Real-time stereo matching is used to construct an accurate 3D 
head model and view interpolation (Section 13.1) is used to synthesize the novel in-between 
view (Criminisi, Shotton, Blake et al. 2003). 

11.5 Global optimization 

Global stereo matching methods perform some optimization or iteration steps after the dis- 
parity computation phase and often skip the aggregation step altogether, because the global 
smoothness constraints perform a similar function. Many global methods are formulated in 
an energy-minimization framework, where, as we saw in Sections 3.7 (3.100-3.102) and 8.4, 
the objective is to find a solution d that minimizes a global energy, 

E(d) = E d (d) + XE s (d)- (11.8) 

The data term, E d (d), measures how well the disparity function d agrees with the input image 
pair. Using our previously defined disparity space image, we define this energy as 

Ej{d) = Y. C(x,y,d(x,y)), (11.9) 

{*,y) 

where C is the (initial or aggregated) matching cost DSI. 

The smoothness term E s (d) encodes the smoothness assumptions made by the algorithm. 
To make the optimization computationally tractable, the smoothness term is often restricted 
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to measuring only the differences between neighboring pixels’ disparities, 

E s (d) = ^2 P( d ( x ’V) ~ d ( x + !)2/)) + P( d (x,y) - d{x,y + 1)), (11.10) 

(x,y) 

where p is some monotonically increasing function of disparity difference. It is also possi- 
ble to use larger neighborhoods, such as As, which can lead to better boundaries (Boykov 
and Kolmogorov 2003), or to use second-order smoothness terms (Woodford, Reid, Torr et 
al. 2008), but such terms require more complex optimization techniques. An alternative to 
smoothness functionals is to use a lower-dimensional representation such as splines (Szeliski 
and Coughlan 1997). 

In standard regularization (Section 3.7. 1), p is a quadratic function, which makes d smooth 
everywhere and may lead to poor results at object boundaries. Energy functions that do not 
have this problem are called discontinuity-preserving and are based on robust p functions 
(Terzopoulos 1986b; Black and Rangarajan 1996). The seminal paper by Geman and Ge- 
man (1984) gave a Bayesian interpretation of these kinds of energy functions and proposed a 
discontinuity-preserving energy function based on Markov random fields (MRFs) and addi- 
tional line processes, which are additional binary variables that control whether smoothness 
penalties are enforced or not. Black and Rangarajan (1996) show how independent line pro- 
cess variables can be replaced by robust pairwise disparity terms. 

The terms in E s can also be made to depend on the intensity differences, e.g., 

p d {d{x,y) -d(x + l,y)) ■ p I (\\I(x,y) - I(x + l,y)\\), (11.11) 

where pj is some monotonically decreasing function of intensity differences that lowers 
smoothness costs at high-intensity gradients. This idea (Gamble and Poggio 1987; Fua 1993; 
Bobick and Intille 1999; Boykov, Veksler, and Zabih 2001) encourages disparity discontinu- 
ities to coincide with intensity or color edges and appears to account for some of the good 
performance of global optimization approaches. While most researchers set these functions 
heuristically, Scharstein and Pal (2007) show how the free parameters in such conditional 
random fields (Section 3.7.2, (3.118)) can be learned from ground truth disparity maps. 

Once the global energy has been defined, a variety of algorithms can be used to find a 
(local) minimum. Traditional approaches associated with regularization and Markov random 
fields include continuation (Blake and Zisserman 1987), simulated annealing (Geman and 
Geman 1984; Marroquin, Mitter, and Poggio 1987; Barnard 1989), highest confidence first 
(Chou and Brown 1990), and mean-field annealing (Geiger and Girosi 1991). 

More recently, max-flow and graph cut methods have been proposed to solve a special 
class of global optimization problems (Roy and Cox 1998; Boykov, Veksler, and Zabih 2001; 
Ishikawa 2003). Such methods are more efficient than simulated annealing and have produced 
good results, as have techniques based on loopy belief propagation (Sun, Zheng, and Shum 
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2003; Tappen and Freeman 2003). Appendix B.5 and a recent survey paper on MRF inference 
(Szeliski, Zabih, Scharstein et al. 2008) discuss and compare such techniques in more detail. 

While global optimization techniques currently produce the best stereo matching results, 
there are some alternative approaches worth studying. 


Cooperative algorithms. Cooperative algorithms, inspired by computational models of hu- 
man stereo vision, were among the earliest methods proposed for disparity computation (Dev 
1974; Marr and Poggio 1976; Marroquin 1983; Szeliski and Hinton 1985; Zitnick and Kanade 
2000). Such algorithms iteratively update disparity estimates using non-linear operations that 
result in an overall behavior similar to global optimization algorithms. In fact, for some of 
these algorithms, it is possible to explicitly state a global function that is being minimized 
(Scharstein and Szeliski 1998). 


Coarse-to-fine and incremental warping. Most of today’s best algorithms first enumer- 
ate all possible matches at all possible disparities and then select the best set of matches in 
some way. Faster approaches can sometimes be obtained using methods inspired by classic 
(infinitesimal) optical flow computation. Here, images are successively warped and disparity 
estimates incrementally updated until a satisfactory registration is achieved. These techniques 
are most often implemented within a coarse-to-fine hierarchical refinement framework (Quam 
1984; Bergen, Anandan, Hanna el al. 1992; Barron, Fleet, and Beauchemin 1994; Szeliski 
and Coughlan 1997). 


11.5.1 Dynamic programming 

A different class of global optimization algorithm is based on dynamic programming. While 
the 2D optimization of Equation (11.8) can be shown to be NP-hard for common classes 
of smoothness functions (Veksler 1999), dynamic programming can find the global mini- 
mum for independent scanlines in polynomial time. Dynamic programming was first used 
for stereo vision in sparse, edge-based methods (Baker and Binford 1981; Ohta and Kanade 
1985). More recent approaches have focused on the dense (intensity-based) scanline match- 
ing problem (Belhumeur 1996; Geiger, Ladendorf, and Yuille 1992; Cox, Hingorani, Rao el 
al. 1996; Bobick and Intille 1999; Birchfield and Tomasi 1999). These approaches work by 
computing the minimum-cost path through the matrix of all pairwise matching costs between 
two corresponding scanlines, i.e., through a horizontal slice of the DSI. Partial occlusion is 
handled explicitly by assigning a group of pixels in one image to a single pixel in the other 
image. Figure 11.11 schematically shows how DP works, while Figure 11. 5f shows a real 
DSI slice over which the DP is applied. 
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(a) 



Figure 11.11 Stereo matching using dynamic programming, as illustrated by (a) Scharstein 
and Szeliski (2002) © 2002 Springer and (b) Kolmogorov, Criminisi, Blake et al. (2006). © 
2006 IEEE. For each pair of corresponding scanlines, a minimizing path through the matrix 
of all pairwise matching costs (DSI) is selected. Lowercase letters (a-k) symbolize the inten- 
sities along each scanline. Uppercase letters represent the selected path through the matrix. 
Matches are indicated by M, while partially occluded points (which have a fixed cost) are 
indicated by L or R, corresponding to points only visible in the left or right images, respec- 
tively. Usually, only a limited disparity range is considered (0-4 in the figure, indicated by 
the non-shaded squares). The representation in (a) allows for diagonal moves while the rep- 
resentation in (b) does not. Note that these diagrams, which use the Cyclopean representation 
of depth, i.e., depth relative to a camera between the two input cameras, show an “unskewed” 
x-d slice through the DSI. 


To implement dynamic programming for a scanline y, each entry (state) in a 2D cost 
matrix D(m, n) is computed by combining its DSI value 

C'(m,ri) = C(m + n,m — n,y) (11.12) 

with one of its predecessor cost values. Using the representation shown in Figure 11.11a, 
which allows for “diagonal” moves, the aggregated match costs can be recursively computed 
as 

D(m, n, M) = 1, n — 1, M), D(m— 1, n, L), D(m— 1, n— 1, R)) + C'(m , n ) 

D(m,n,L) = mm(D(m—l,n—l,M),D(m—l,n,L)) + 0 (11.13) 

D(m,n,R) = min(D(m, n— 1, M), D(m, n— 1, R)) + O, 

where O is a per-pixel occlusion cost. The aggregation rules corresponding to Figure 1 1. 1 lb 
are given by Kolmogorov, Criminisi, Blake et al. (2006), who also use a two-state foreground- 
background model for bi-layer segmentation. 
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Problems with dynamic programming stereo include the selection of the right cost for 
occluded pixels and the difficulty of enforcing inter-scanline consistency, although several 
methods propose ways of addressing the latter (Ohta and Kanade 1985; Belhumeur 1996; 
Cox, Hingorani, Rao et al. 1996; Bobick and Intille 1999; Birchfield and Tomasi 1999; 
Kolmogorov, Criminisi, Blake et al. 2006). Another problem is that the dynamic program- 
ming approach requires enforcing the monotonicity or ordering constraint (Yuille and Poggio 
1984). This constraint requires that the relative ordering of pixels on a scanline remain the 
same between the two views, which may not be the case in scenes containing narrow fore- 
ground objects. 

An alternative to traditional dynamic programming, introduced by Scharstein and Szeliski 
(2002), is to neglect the vertical smoothness constraints in (11.10) and simply optimize in- 
dependent scanlines in the global energy function (11.8), which can easily be done using a 
recursive algorithm, 

D(x, y, d) = C(x,y,d) + min{.D(x - 1 ,y,d') + p d (d - d')} . (11-14) 

d' 

The advantage of this scanline optimization algorithm is that it computes the same represen- 
tation and minimizes a reduced version of the same energy function as the full 2D energy 
function (11.8). Unfortunately, it still suffers from the same streaking artifacts as dynamic 
programming. 

A much better approach is to evaluate the cumulative cost function (11.14) from multiple 
directions, e.g, from the eight cardinal directions, N, E, W, S, NE, SE, SW, NW (Hirschmuller 
2008). The resulting semi-global optimization performs quite well and is extremely efficient 
to implement. 

Even though dynamic programming and scanline optimization algorithms do not gen- 
erally produce the most accurate stereo reconstructions, when combined with sophisticated 
aggregation strategies, they can produce very fast and high-quality results. 

11.5.2 Segmentation-based techniques 

While most stereo matching algorithms perform their computations on a per-pixel basis, some 
of the more recent techniques first segment the images into regions and then try to label each 
region with a disparity. 

For example, Tao, Sawhney, and Kumar (2001) segment the reference image, estimate 
per-pixel disparities using a local technique, and then do local plane fits inside each segment 
before applying smoothness constraints between neighboring segments. Zitnick, Kang, Uyt- 
tendaele et al. (2004) and Zitnick and Kang (2007) use over-segmentation to mitigate initial 
bad segmentations. After a set of initial cost values for each segment has been stored into 
a disparity space distribution (DSD), iterative relaxation (or loopy belief propagation, in the 
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(a) (b) (c) (d) (e) 


Figure 11.12 Segmentation-based stereo matching (Zitnick, Kang, Uyttendaele et al. 2004) 
© 2004 ACM: (a) input color image; (b) color-based segmentation; (c) initial disparity es- 
timates; (d) final piecewise-smoothed disparities; (e) MRF neighborhood defined over the 
segments in the disparity space distribution (Zitnick and Kang 2007) © 2007 Springer. 
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Figure 11.13 Stereo matching with adaptive over-segmentation and matting (Taguchi, 
Wilburn, and Zitnick 2008) © 2008 IEEE: (a) segment boundaries are refined during the 
optimization, leading to more accurate results (e.g., the thin green leaf in the bottom row); (b) 
alpha mattes are extracted at segment boundaries, which leads to visually better compositing 
results (middle column). 


more recent work of Zitnick and Kang (2007)) is used to adjust the disparity estimates for 
each segment, as shown in Figure 11.12. Taguchi, Wilburn, and Zitnick (2008) refine the 
segment shapes as part of the optimization process, which leads to much improved results, as 
shown in Figure 11.13. 

Even more accurate results are obtained by Klaus, Sormann, and Karner (2006), who first 
segment the reference image using mean shift, run a small (3 x 3) SAD plus gradient SAD 
(weighted by cross-checking) to get initial disparity estimates, fit local planes, re-fit with 
global planes, and then run a final MRF on plane assignments with loopy belief propagation. 
When the algorithm was first introduced in 2006, it was the top ranked algorithm on the 
evaluation site at http://vision.middlebury.edu/stereo; in early 2010, it still had the top rank 
on the new evaluation datasets. 

The highest ranked algorithm, by Wang and Zheng (2008), follows a similar approach of 
segmenting the image, doing local plane fits, and then performing cooperative optimization 




558 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 


of neighboring plane fit parameters. Another highly ranked algorithm, by Yang, Wang, Yang 
et al. (2009), uses the color correlation approach of Yoon and Kweon (2006) and hierarchical 
belief propagation to obtain an initial set of disparity estimates. After left-right consistency 
checking to detect occluded pixels, the data terms for low-confidence and occluded pixels 
are recomputed using segmentation-based plane fits and one or more rounds of hierarchical 
belief propagation are used to obtain the final disparity estimates. 

Another important ability of segmentation-based stereo algorithms, which they share with 
algorithms that use explicit layers (Baker, Szeliski, and Anandan 1998; Szeliski and Golland 
1999) or boundary extraction (Hasinoff, Kang, and Szeliski 2006), is the ability to extract 
fractional pixel alpha mattes at depth discontinuities (Bleyer, Gelautz, Rother et al. 2009). 
This ability is crucial when attempting to create virtual view interpolation without clinging 
boundary or tearing artifacts (Zitnick, Kang, Uyttendaele et al. 2004) and also to seamlessly 
insert virtual objects (Taguchi, Wilburn, and Zitnick 2008), as shown in Figure 1 1.13b. 

Since new stereo matching algorithms continue to be introduced every year, it is a good 
idea to periodically check the Middlebury evaluation site at http://vision.middlebury.edu/ 
stereo for a listing of the most recent algorithms to be evaluated. 


11.5.3 Application : Z-keying and background replacement 

Another application of real-time stereo matching is z-keying , which is the process of seg- 
menting a foreground actor from the background using depth information, usually for the 
purpose of replacing the background with some computer-generated imagery, as shown in 
Figure 11. 2g. 

Originally, z-keying systems required expensive custom-built hardware to produce the 
desired depth maps in real time and were, therefore, restricted to broadcast studio applica- 
tions (Kanade, Yoshida, Oda et al. 1996; Iddan and Yahav 2001). Off-line systems were also 
developed for estimating 3D multi-viewpoint geometry from video streams (Section 13.5.4) 
(Kanade, Rander, and Narayanan 1997; Carranza, Theobalt, Magnor et al. 2003; Zitnick, 
Kang, Uyttendaele et al. 2004; Vedula, Baker, and Kanade 2005). Recent advances in highly 
accurate real-time stereo matching, however, now make it possible to perform z-keying on 
regular PCs, enabling desktop videoconferencing applications such as those shown in Fig- 
ure 11.14 (Kolmogorov, Criminisi, Blake et al. 2006). 


11.6 Multi- view stereo 

While matching pairs of images is a useful way of obtaining depth information, matching 
more images can lead to even better results. In this section, we review not only techniques for 
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Figure 11.14 Background replacement using z-keying with a bi-layer segmentation algo- 
rithm (Kolmogorov, Criminisi, Blake et al. 2006) © 2006 IEEE. 

creating complete 3D object models, but also simpler techniques for improving the quality of 
depth maps using multiple source images. 

As we saw in our discussion of plane sweep (Section 11.1.2), it is possible to resample 
all neighboring k images at each disparity hypothesis d into a generalized disparity space 
volume I(x , y , d , k). The simplest way to take advantage of these additional images is to sum 
up their differences from the reference image I r as in (1 1.4), 

C(x,y,d) = ^p(I(x,y,d,k) - I r (x,y)). (11.15) 

k 

This is the basis of the well-known sum of summed-squared-difference (SSSD) and SSAD 
approaches (Okutomi and Kanade 1993; Kang, Webb, Zitnick et al. 1995), which can be ex- 
tended to reason about likely patterns of occlusion (Nakamura, Matsuura, Satoh et al. 1996). 
More recent work by Gallup, Frahm, Mordohai et al. (2008) show how to adapt the base- 
lines used to the expected depth in order to get the best tradeoff between geometric accuracy 
(wide baseline) and robustness to occlusion (narrow baseline). Alternative multi-view cost 
metrics include measures such as synthetic focus sharpness and the entropy of the pixel color 
distribution (Vaish, Szeliski, Zitnick et al. 2006). 

A useful way to visualize the multi-frame stereo estimation problem is to examine the 
epipolar plane image (EPI) formed by stacking corresponding scanlines from all the images, 
as shown in Figures 8.13c and 11.15 (Bolles, Baker, and Marimont 1987; Baker and Bolles 
1989; Baker 1989). As you can see in Figure 11.15, as a camera translates horizontally (in a 
standard horizontally rectified geometry), objects at different depths move sideways at a rate 
inversely proportional to their depth (1 1.1). 6 Foreground objects occlude background objects, 
which can be seen as EPI-strips (Criminisi, Kang, Swaminathan et al. 2005) occluding other 

6 The four-dimensional generalization of the EPI is the light field, which we study in Section 13.3. In principle, 
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Figure 11.15 Epipolar plane image (EPI) (Gortler, Grzeszczuk, Szeliski etal. 1996) © 1996 
ACM and a schematic EPI (Kang, Szeliski, and Chai 2001) © 2001 IEEE, (a) The Lumigraph 
(light field) (Section 13.3) is the 4D space of all light rays passing through a volume of space. 
Taking a 2D slice results in all of the light rays embedded in a plane and is equivalent to a 
scanline taken from a stacked EPI volume. Objects at different depths move sideways with 
velocities (slopes) proportional to their inverse depth. Occlusion (and translucency) effects 
can easily be seen in this representation, (b) The EPI corresponding to Figure 11.16 showing 
the three images (middle, left, and right) as slices through the EPI volume. The spatially and 
temporally shifted window around the black pixel is indicated by the rectangle, showing the 
right image is not being used in matching. 


strips in the EPI. If we are given a dense enough set of images, we can find such strips and 
reason about their relationships in order to both reconstruct the 3D scene and make inferences 
about translucent objects (Tsin, Kang, and Szeliski 2006) and specular reflections (Swami- 
nathan, Kang, Szeliski el al. 2002; Criminisi, Kang, Swaminathan et al. 2005). Alternatively, 
we can treat the series of images as a set of sequential observations and merge them using 
Kalman filtering (Matthies, Kanade, and Szeliski 1989) or maximum likelihood inference 
(Cox 1994). 

When fewer images are available, it becomes necessary to fall back on aggregation tech- 
niques such as sliding windows or global optimization. With additional input images, how- 
ever, the likelihood of occlusions increases. It is therefore prudent to adjust not only the best 
window locations using a shiftable window approach, as shown in Figure 11.16a, but also to 
optionally select a subset of neighboring frames in order to discount those images where the 
region of interest is occluded, as shown in Figure 11.16b (Kang, Szeliski, and Chai 2001). 


there is enough information in a light field to recover both the shape and the BRDF of objects (Soatto, Yezzi, and Jin 
2003), although relatively little progress has been made to date on this topic. 
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Figure 11.16 Spatio-temporally shiftable windows (Kang, Szeliski, and Chai 2001) © 2001 
IEEE: A simple three-image sequence (the middle image is the reference image), which has 
a moving frontal gray square (marked F) and a stationary background. Regions B, C, D, and 
E are partially occluded, (a) A regular SSD algorithm will make mistakes when matching 
pixels in these regions (e.g. the window centered on the black pixel in region B) and in 
windows straddling depth discontinuities (the window centered on the white pixel in region 
F). (b) Shiftable windows help mitigate the problems in partially occluded regions and near 
depth discontinuities. The shifted window centered on the white pixel in region F matches 
correctly in all frames. The shifted window centered on the black pixel in region B matches 
correctly in the left image, but requires temporal selection to disable matching the right image. 
Figure 1 1.15b shows an EPI corresponding to this sequence and describes in more detail how 
temporal selection works. 


Figure 1 1.15b shows how such spatio-temporal selection or shifting of windows corresponds 
to selecting the most likely un-occluded volumetric region in the epipolar plane image vol- 
ume. 

The results of applying these techniques to the multi-frame flower garden image sequence 
are shown in Figure 11.17, which compares the results of using regular (non-shifted) SSSD 
with spatially shifted windows and full spatio-temporal window selection. (The task of 
applying stereo to a rigid scene filmed with a moving camera is sometimes called motion 
stereo). Similar improvements from using spatio-temporal selection are reported by (Kang 
and Szeliski 2004) and are evident even when local measurements are combined with global 
optimization. 

While computing a depth map from multiple inputs outperforms pairwise stereo match- 
ing, even more dramatic improvements can be obtained by estimating multiple depth maps 
simultaneously (Szeliski 1999; Kang and Szeliski 2004). The existence of multiple depth 
maps enables more accurate reasoning about occlusions, as regions which are occluded in 
one image may be visible (and matchable) in others. The multi-view reconstruction problem 
can be formulated as the simultaneous estimation of depth maps at key frames (Figure 8.13c) 
while maximizing not only photoconsistency and piecewise disparity smoothness but also the 
consistency between disparity estimates at different frames. While Szeliski (1999) and Kang 
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(a) (b) (c) (d) 


Figure 11.17 Local (5x5 window -based) matching results (Kang, Szeliski, and Chai 2001) 
© 2001 IEEE: (a) window that is not spatially perturbed (centered); (b) spatially perturbed 
window; (c) using the best five of 10 neighboring frames; (d) using the better half sequence. 
Notice how the results near the tree trunk are improved using temporal selection. 

and Szeliski (2004) use soft (penalty-based) constraints to encourage multiple disparity maps 
to be consistent, Kolmogorov and Zabih (2002) show how such consistency measures can 
be encoded as hard constraints, which guarantee that the multiple depth maps are not only 
similar but actually identical in overlapping regions. Newer algorithms that simultaneously 
estimate multiple disparity maps include papers by Maitre, Shinagawa, and Do (2008) and 
Zhang, Jia, Wong et al. (2008). 

A closely related topic to multi -frame stereo estimation is scene flow , in which multiple 
cameras are used to capture a dynamic scene. The task is then to simultaneously recover the 
3D shape of the object at every instant in time and to estimate the full 3D motion of every 
surface point between frames. Representative papers in this area include those by Vedula, 
Baker, Rander et al. (2005), Zhang and Kambhamettu (2003), Pons, Keriven, and Faugeras 
(2007), Huguet and Devernay (2007), and Wedel, Rabe, Vaudrey et al. (2008). Figure 1 1.18a 
shows an image of the 3D scene flow for the tango dancer shown in Figure 11.2h-j, while 
Figure 11.18b shows 3D scene flows captured from a moving vehicle for the purpose of 
obstacle avoidance. In addition to supporting mensuration and safety applications, scene 
flow can be used to support both spatial and temporal view interpolation (Section 13.5.4), as 
demonstrated by Vedula, Baker, and Kanade (2005). 

11.6.1 Volumetric and 3D surface reconstruction 

According to Seitz, Curless, Diebel et al. (2006): 

The goal of multi-view stereo is to reconstruct a complete 3D object model from 
a collection of images taken from known camera viewpoints. 

The most challenging but potentially most useful variant of multi-view stereo reconstruc- 
tion is to create globally consistent 3D models. This topic has a long history in computer 
vision, starting with surface mesh reconstruction techniques such as the one developed by 
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(a) (b) 


Figure 11.18 Three-dimensional scene flow: (a) computed from a multi-camera dome sur- 
rounding the dancer shown in Figure 11.2h-j (Vedula, Baker, Rander et al. 2005) © 2005 
IEEE; (b) computed from stereo cameras mounted on a moving vehicle (Wedel, Rabe, Vau- 
drey et al. 2008) © 2008 Springer. 


Fua and Leclerc (1995) (Figure 11.19a). A variety of approaches and representations have 
been used to solve this problem, including 3D voxel representations (Seitz and Dyer 1999; 
Szeliski and Golland 1999; De Bonet and Viola 1999; Kutulakos and Seitz 2000; Eisert, Stein- 
bach, and Girod 2000; Slabaugh, Culbertson, Slabaugh et al. 2004; Sinha and Pollefeys 2005; 
Vogiatzis, Hernandez, Torr et al. 2007; Hiep, Keriven, Pons et al. 2009), level sets (Faugeras 
and Keriven 1998; Pons, Keriven, and Faugeras 2007), polygonal meshes (Fua and Feclerc 
1995; Narayanan, Rander, and Kanade 1998; Hernandez and Schmitt 2004; Furukawa and 
Ponce 2009), and multiple depth maps (Kolmogorov and Zabih 2002). Figure 11.19 shows 
representative examples of 3D object models reconstructed using some of these techniques. 

In order to organize and compare all these techniques, Seitz, Curless, Diebel et al. (2006) 
developed a six-point taxonomy that can help classify algorithms according to the scene rep- 
resentation, photoconsistency measure, visibility model, shape priors, reconstruction algo- 
rithm, and initialization requirements they use. Below, we summarize some of these choices 
and list a few representative papers. For more details, please consult the full survey paper 
(Seitz, Curless, Diebel et al. 2006) and the evaluation Web site, http://vision.middlebury.edu/ 
mview/, which contains pointers to even more recent papers and results. 

Scene representation. One of the more popular 3D representations is a uniform grid of 3D 
voxels, 7 which can be reconstructed using a variety of carving (Seitz and Dyer 1999; Kutu- 
lakos and Seitz 2000) or optimization (Sinha and Pollefeys 2005; Vogiatzis, Hernandez, Torr 
et al. 2007; Hiep, Keriven, Pons et al. 2009) techniques. Fevel set techniques (Section 5.1.4) 
also operate on a uniform grid but, instead of representing a binary occupancy map, they 
represent the signed distance to the surface (Faugeras and Keriven 1998; Pons, Keriven, and 
Faugeras 2007), which can encode a finer level of detail. Polygonal meshes are another pop- 

7 For outdoor scenes that go to infinity, a non-uniform gridding of space may be preferable (Slabaugh, Culbertson, 
Slabaugh et al. 2004). 
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Figure 11.19 Multi-view stereo algorithms: (a) surface-based stereo (Fua and Leclerc 
1995); (b) voxel coloring (Seitz and Dyer 1999) © 1999 Springer; (c) depth map merg- 
ing (Narayanan, Rander, and Kanade 1998); (d) level set evolution (Faugeras and Keriven 
1998) © 1998 IEEE; (e) silhouette and stereo fusion (Hernandez and Schmitt 2004) © 2004 
Elsevier; (f) multi-view image matching (Pons, Keriven, and Faugeras 2005) © 2005 IEEE; 
(g) volumetric graph cut (Vogiatzis, Torr, and Cipolla 2005) © 2005 IEEE; (h) carved visual 
hulls (Furukawa and Ponce 2009) © 2009 Springer. 


ular representation (Fua and Leclerc 1995; Narayanan, Rander, and Kanade 1998; Isidoro 
and Sclaroff 2003; Hernandez and Schmitt 2004; Furukawa and Ponce 2009; Hiep, Keriven, 
Pons el al. 2009). Meshes are the standard representation used in computer graphics and 
also readily support the computation of visibility and occlusions. Finally, as we discussed in 
the previous section, multiple depth maps can also be used (Szeliski 1999; Kolmogorov and 
Zabih 2002; Kang and Szeliski 2004). Many algorithms also use more than a single represen- 
tation, e.g., they may start by computing multiple depth maps and then merge them into a 3D 
object model (Narayanan, Rander, and Kanade 1998; Furukawa and Ponce 2009; Goesele, 
Curless, and Seitz 2006; Goesele, Snavely, Curless el al. 2007; Furukawa, Curless, Seitz el 
al. 2010). 
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Photoconsistency measure. As we discussed in (Section 11.3.1), a variety of similarity 
measures can be used to compare pixel values in different images, including measures that 
try to discount illumination effects or be less sensitive to outliers. In multi-view stereo, algo- 
rithms have a choice of computing these measures directly on the surface of the model, i.e., in 
scene space, or projecting pixel values from one image (or from a textured model) back into 
another image, i.e., in image space. (The latter corresponds more closely to a Bayesian ap- 
proach, since input images are noisy measurements of the colored 3D model.) The geometry 
of the object, i.e., its distance to each camera and its local surface normal, when available, can 
be used to adjust the matching windows used in the computation to account for foreshortening 
and scale change (Goesele, Snavely, Curless et al. 2007). 

Visibility model. A big advantage that multi-view stereo algorithms have over single -depth- 
map approaches is their ability to reason in a principled manner about visibility and occlu- 
sions. Techniques that use the current state of the 3D model to predict which surface pixels 
are visible in each image (Kutulakos and Seitz 2000; Faugeras and Keriven 1998; Vogiatzis, 
Hernandez, Torr el al. 2007; Hiep, Keriven, Pons el al. 2009) are classified as using geometric 
visibility models in the taxonomy of Seitz, Curless, Diebel el al. (2006). Techniques that se- 
lect a neighboring subset of image to match are called quasi-geometric (Narayanan, Rander, 
and Kanade 1998; Kang and Szeliski 2004; Hernandez and Schmitt 2004), while techniques 
that use traditional robust similarity measures are called outlier-based. While full geometric 
reasoning is the most principled and accurate approach, it can be very slow to evaluate and 
depends on the evolving quality of the current surface estimate to predict visibility, which can 
be a bit of a chicken-and-egg problem, unless conservative assumptions are used, as they are 
by Kutulakos and Seitz (2000). 

Shape priors. Because stereo matching is often underconstrained, especially in texture- 
less regions, most matching algorithms adopt (either explicitly or implicitly) some form of 
prior model for the expected shape. Many of the techniques that rely on optimization use a 
3D smoothness or area-based photoconsistency constraint, which, because of the natural ten- 
dency of smooth surfaces to shrink inwards, often results in a minimal surface prior (Faugeras 
and Keriven 1998; Sinha and Pollefeys 2005; Vogiatzis, Hernandez, Torr et al. 2007). Ap- 
proaches that carve away the volume of space often stop once a photoconsistent solution is 
found (Seitz and Dyer 1999; Kutulakos and Seitz 2000), which corresponds to a maximal sur- 
face bias, i.e., these techniques tend to over-estimate the true shape. Finally, multiple depth 
map approaches often adopt traditional image-based smoothness (regularization) constraints. 

Reconstruction algorithm. The details of how the actual reconstruction algorithm pro- 
ceeds is where the largest variety — and greatest innovation — in multi-view stereo algorithms 
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can be found. 

Some approaches use global optimization defined over a three-dimensional photoconsis- 
tency volume to recover a complete surface. Approaches based on graph cuts use polynomial 
complexity binary segmentation algorithms to recover the object model defined on the voxel 
grid (Sinha and Pollefeys 2005; Vogiatzis, Hernandez, Torr el al. 2007; Hiep, Keriven, Pons 
el al. 2009). Level set approaches use a continuous surface evolution to find a good mini- 
mum in the configuration space of potential surfaces and therefore require a reasonably good 
initialization (Faugeras and Keriven 1998; Pons, Keriven, and Faugeras 2007). In order for 
the photoconsistency volume to be meaningful, matching costs need to be computed in some 
robust fashion, e.g., using sets of limited views or by aggregating multiple depth maps. 

An alternative approach to global optimization is to sweep through the 3D volume while 
computing both photoconsistency and visibility simultaneously. The voxel coloring algorithm 
of Seitz and Dyer (1999) performs a front-to-back plane sweep. On every plane, any voxels 
that are sufficiently photoconsistent are labeled as part of the object. The corresponding 
pixels in the source images can then be “erased”, since they are already accounted for, and 
therefore do not contribute to further photoconsistency computations. (A similar approach, 
albeit without the front-to-back sweep order, is used by Szeliski and Golland (1999).) The 
resulting 3D volume, under noise- and resampling-free conditions, is guaranteed to produce 
both a photoconsistent 3D model and to enclose whatever true 3D object model generated the 
images. 

Unfortunately, voxel coloring is only guaranteed to work if all of the cameras lie on the 
same side of the sweep planes, which is not possible in general ring configurations of cameras. 
Kutulakos and Seitz (2000) generalize voxel coloring to space carving , where subsets of 
cameras that satisfy the voxel coloring constraint are iteratively selected and the 3D voxel 
grid is alternately carved away along different axes. 

Another popular approach to multi-view stereo is to first independently compute multiple 
depth maps and then merge these partial maps into a complete 3D model. Approaches to 
depth map merging, which are discussed in more detail in Section 12.2.1, include signed 
distance functions (Curless and Levoy 1996), used by Goesele, Curless, and Seitz (2006), 
and Poisson surface reconstruction (Kazhdan, Bolitho, and Hoppe 2006), used by Goesele, 
Snavely, Curless et al. (2007). It is also possible to reconstruct sparser representations, such 
as 3D points and lines, and to interpolate them to full 3D surfaces (Section 12.3.1) (Taylor 
2003). 

Initialization requirements. One final element discussed by Seitz, Curless, Diebel el al. 
(2006) is the varying degrees of initialization required by different algorithms. Because some 
algorithms refine or evolve a rough 3D model, they require a reasonably accurate (or over- 
complete) initial model, which can often be obtained by reconstructing a volume from object 
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(a) (b) (c) (d) (e) (f) 

Figure 11.20 The multi-view stereo data sets captured by Seitz, Curless, Diebel et al. (2006) 
© 2006 Springer. Only (a) and (b) are currently used for evaluation. 


silhouettes, as discussed in Section 11.6.2. However, if the algorithm performs a global op- 
timization (Kolev, Klodt, Brox et al. 2009; Kolev and Cremers 2009), this dependence on 
initialization is not an issue. 

Empirical evaluation. In order to evaluate the large number of design alternatives in multi- 
view stereo, Seitz, Curless, Diebel et al. (2006) collected a dataset of calibrated images using 
a spherical gantry. A representative image from each of the six datasets is shown in Fig- 
ure 1 1 .20, although only the first two datasets have as yet been fully processed and used for 
evaluation. Figure 11.21 shows the results of running seven different algorithms on the tem- 
ple dataset. As you can see, most of the techniques do an impressive job of capturing the fine 
details in the columns, although it is also clear that the techniques employ differing amounts 
of smoothing to achieve these results. 

Since the publication of the survey by Seitz, Curless, Diebel et al. (2006), the field of 
multi-view stereo has continued to advance at a rapid pace (Strecha, Fransens, and Van 
Gool 2006; Hernandez, Vogiatzis, and Cipolla 2007; Habbecke and Kobbelt 2007; Furukawa 
and Ponce 2007; Vogiatzis, Hernandez, Torr et al. 2007; Goesele, Snavely, Curless et al. 
2007; Sinha, Mordohai, and Pollefeys 2007; Gargallo, Prados, and Sturm 2007; Merrell, Ak- 
barzadeh, Wang et al. 2007; Zach, Pock, and Bischof 2007b; Furukawa and Ponce 2008; 
Hornung, Zeng, and Kobbelt 2008; Bradley, Boubekeur, and Heidrich 2008; Zach 2008; 
Campbell, Vogiatzis, Hernandez et al. 2008; Kolev, Klodt, Brox et al. 2009; Hiep, Keriven, 
Pons et al. 2009; Furukawa, Curless, Seitz et al. 2010). The multi-view stereo evaluation site, 
http://vision.middlebury.edu/mview/, provides quantitative results for these algorithms along 
with pointers to where to find these papers. 

11.6.2 Shape from silhouettes 

In many situations, performing a foreground-background segmentation of the object of in- 
terest is a good way to initialize or fit a 3D model (Grauman, Shakhnarovich, and Darrell 
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Figure 11.21 Reconstruction results (details) for seven algorithms (Hernandez and Schmitt 
2004; Furukawa and Ponce 2009; Pons, Keriven, and Faugeras 2005; Goesele, Curless, and 
Seitz 2006; Vogiatzis, Torr, and Cipolla 2005; Tran and Davis 2002; Kolmogorov and Zabih 
2002) evaluated by Seitz, Curless, Diebel et al. (2006) on the 47-image Temple Ring dataset. 
The numbers underneath each detail image are the accuracy of each of these techniques mea- 
sured in millimeters. 


2003; Vlasic, Baran, Matusik et al. 2008) or to impose a convex set of constraints on multi- 
view stereo (Kolev and Cremers 2008). Over the years, a number of techniques have been 
developed to reconstruct a 3D volumetric model from the intersection of the binary silhou- 
ettes projected into 3D. The resulting model is called a visual hull (or sometimes a line hull), 
analogous with the convex hull of a set of points, since the volume is maximal with respect 
to the visual silhouettes and surface elements are tangent to the viewing rays (lines) along 
the silhouette boundaries (Laurentini 1994). It is also possible to carve away a more accu- 
rate reconstruction using multi-view stereo (Sinha and Pollefeys 2005) or by analyzing cast 
shadows (Savarese, Andreetto, Rushmeier et al. 2007). 

Some techniques first approximate each silhouette with a polygonal representation and 
then intersect the resulting faceted conical regions in three-space to produce polyhedral mod- 
els (Baumgart 1974; Martin and Aggarwal 1983; Matusik, Buehler, and McMillan 2001), 
which can later be refined using triangular splines (Sullivan and Ponce 1998). Other ap- 
proaches use voxel-based representations, usually encoded as octrees (Samet 1989), because 
of the resulting space-time efficiency. Figures 1 1 .22a-b show an example of a 3D octree 
model and its associated colored tree, where black nodes are interior to the model, white 
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Figure 11.22 Volumetric octree reconstruction from binary silhouettes (Szeliski 1993) © 
1993 Elsevier: (a) octree representation and its corresponding (b) tree structure; (c) input 
image of an object on a turntable; (d) computed 3D volumetric octree model. 


nodes are exterior, and gray nodes are of mixed occupancy. Examples of octree-based re- 
construction approaches include those by Potmesil (1987), Noborio, Fukada, and Arimoto 
(1988), Srivasan, Liang, and Hackwood (1990), and Szeliski (1993). 

The approach of Szeliski (1993) first converts each binary silhouette into a one-sided 
variant of a distance map, where each pixel in the map indicates the largest square that is 
completely inside (or outside) the silhouette. This makes it fast to project an octree cell 
into the silhouette to confirm whether it is completely inside or outside the object, so that 
it can be colored black, white, or left as gray (mixed) for further refinement on a smaller 
grid. The octree construction algorithm proceeds in a coarse-to-hne manner, first building an 
octree at a relatively coarse resolution, and then refining it by revisiting and subdividing all 
the input images for the gray (mixed) cells whose occupancy has not yet been determined. 
Figure 11.22d shows the resulting octree model computed from a coffee cup rotating on a 
turntable. 

More recent work on visual hull computation borrows ideas from image-based rendering, 
and is hence called an image-based visual hull (Matusik, Buehler, Raskar et al. 2000). Instead 
of precomputing a global 3D model, an image-based visual hull is recomputed for each new 
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viewpoint, by successively intersecting viewing ray segments with the binary silhouettes in 
each image. This not only leads to a fast computation algorithm but also enables fast texturing 
of the recovered model with color values from the input images. This approach can also 
be combined with high-quality deformable templates to capture and re-animate whole body 
motion (Vlasic, Baran, Matusik et al. 2008). 


11.7 Additional reading 

The field of stereo correspondence and depth estimation is one of the oldest and most widely 
studied topics in computer vision. A number of good surveys have been written over the years 
(Marr and Poggio 1976; Barnard and Fischler 1982; Dhond and Aggarwal 1989; Scharstein 
and Szeliski 2002; Brown, Burschka, and Hager 2003; Seitz, Curless, Diebel et al. 2006) and 
they can serve as good guides to this extensive literature. 

Because of computational limitations and the desire to find appearance-invariant cor- 
respondences, early algorithms often focused on finding sparse correspondences (Hannah 
1974; Marr and Poggio 1979; Mayhew and Frisby 1980; Baker and Binford 1981; Arnold 
1983; Grimson 1985; Ohta and Kanade 1985; Bolles, Baker, and Marimont 1987; Matthies, 
Kanade, and Szeliski 1989; Hsieh, McKeown, and Perlant 1992; Bolles, Baker, and Hannah 
1993). 

The topic of computing epipolar geometry and pre -rectifying images is covered in Sec- 
tions 7.2 and 11.1 and is also treated in textbooks on multi-view geometry (Faugeras and 
Luong 2001; Hartley and Zisserman 2004) and articles specifically on this topic (Torr and 
Murray 1997; Zhang 1998a,b). The concepts of the disparity space and disparity space im- 
age are often associated with the seminal work by Marr (1982) and the papers of Yang, Yuille, 
and Lu (1993) and Intille and Bobick (1994). The plane sweep algorithm was first popular- 
ized by Collins (1996) and then generalized to a full arbitrary projective setting by Szeliski 
and Golland (1999) and Saito and Kanade (1999). Plane sweeps can also be formulated using 
cylindrical surfaces (Ishiguro, Yamamoto, and Tsuji 1992; Kang and Szeliski 1997; Shum 
and Szeliski 1999; Li, Shum, Tang et al. 2004; Zheng, Kang, Cohen et al. 2007) or even more 
general topologies (Seitz 2001). 

Once the topology for the cost volume or DSI has been set up, we need to compute the 
actual photoconsistency measures for each pixel and potential depth. A wide range of such 
measures have been proposed, as discussed in Section 1 1.3.1. Some of these are compared in 
recent surveys and evaluations of matching costs (Scharstein and Szeliski 2002; Hirschmiiller 
and Scharstein 2009). 

To compute an actual depth map from these costs, some form of optimization or selection 
criterion must be used. The simplest of these are sliding windows of various kinds, which 
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are discussed in Section 1 1.4 and surveyed by Gong, Yang, Wang et al. (2007) and Tombari, 
Mattoccia, Di Stefano et al. (2008). More commonly, global optimization frameworks are 
used to compute the best disparity field, as described in Section 11.5. These techniques 
include dynamic programming and truly global optimization algorithms, such as graph cuts 
and loopy belief propagation. Because the literature on this is so extensive, it is described in 
more detail in Section 11.5. A good place to find pointers to the latest results in this field is 
the Middlebury Stereo Vision Page at http://vision.middlebury.edu/stereo. 

Algorithms for multi- view stereo typically fall into two categories. The first include al- 
gorithms that compute traditional depth maps using several images for computing photocon- 
sistency measures (Okutomi and Kanade 1993; Kang, Webb, Zitnick et al. 1995; Nakamura, 
Matsuura, Satoh et al. 1996; Szeliski and Golland 1999; Kang, Szeliski, and Chai 2001; 
Vaish, Szeliski, Zitnick et al. 2006; Gallup, Frahm, Mordohai et al. 2008). Optionally, some 
of these techniques compute multiple depth maps and use additional constraints to encourage 
the different depth maps to be consistent (Szeliski 1999; Kolmogorov and Zabih 2002; Kang 
and Szeliski 2004; Maitre, Shinagawa, and Do 2008; Zhang, Jia, Wong et al. 2008). 

The second category consists of papers that compute true 3D volumetric or surface-based 
object models. Again, because of the large number of papers published on this topic, rather 
than citing them here, we refer you to the material in Section 11.6.1, the survey by Seitz, 
Curless, Diebel et al. (2006), and the on-line evaluation Web site at http://vision.middlebury. 
edu/mview/. 


11.8 Exercises 

Ex 11.1: Stereo pair rectification Implement the following simple algorithm (Section 11.1.1): 

1. Rotate both cameras so that they are looking perpendicular to the line joining the two 
camera centers Co and C\. The smallest rotation can be computed from the cross prod- 
uct between the original and desired optical axes. 

2. Twist the optical axes so that the horizontal axis of each camera looks in the direction 
of the other camera. (Again, the cross product between the current x-axis after the first 
rotation and the line joining the cameras gives the rotation.) 

3. If needed, scale up the smaller (less detailed) image so that it has the same resolution 
(and hence line-to-line correspondence) as the other image. 

Now compare your results to the algorithm proposed by Loop and Zhang (1999). Can you 
think of situations where their approach may be preferable? 
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Ex 11.2: Rigid direct alignment Modify your spline-based or optical flow motion estima- 
tor from Exercise 8.4 to use epipolar geometry, i.e. to only estimate disparity. 

(Optional) Extend your algorithm to simultaneously estimate the epipolar geometry (with- 
out first using point correspondences) by estimating a base homography corresponding to a 
reference plane for the dominant motion and then an epipole for the residual parallax (mo- 
tion). 

Ex 11.3: Shape from profiles Reconstruct a surface model from a series of edge images 
(Section 11.2.1). 

1. Extract edges and link them (Exercises 4.7^1. 8). 

2. Based on previously computed epipolar geometry, match up edges in triplets (or longer 
sets) of images. 

3. Reconstruct the 3D locations of the curves using osculating circles (11.5). 

4. Render the resulting 3D surface model as a sparse mesh, i.e., drawing the reconstructed 
3D profile curves and links between 3D points in neighboring images with similar 
osculating circles. 

Ex 11.4: Plane sweep Implement a plane sweep algorithm (Section 1 1. 1.2). 

If the images are already pre -rectified, this consists simply of shifting images relative to 
each other and comparing pixels. If the images are not pre -rectified, compute the homography 
that resamples the target image into the reference image’s coordinate system for each plane. 

Evaluate a subset of the following similarity measures (Section 1 1.3. 1) and compare their 
performance by visualizing the disparity space image (DSI), which should be dark for pixels 
at correct depths: 

• squared difference (SD); 

• absolute difference (AD); 

• truncated or robust measures; 

• gradient differences; 

• rank or census transform (the latter usually performs better); 

• mutual information from a pre-computed joint density function. 

Consider using the Birchfield and Tomasi (1998) technique of comparing ranges between 
neighboring pixels (different shifted or warped images). Also, try pre-compensating images 
for bias or gain variations using one or more of the techniques discussed in Section 11.3.1. 
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Ex 11.5: Aggregation and window-based stereo Implement one or more of the matching 
cost aggregation strategies described in Section 11.4: 

• convolution with a box or Gaussian kernel; 

• shifting window locations by applying a min filter (Scharstein and Szeliski 2002); 

• picking a window that maximizes some match-reliability metric (Veksler 2001, 2003); 

• weighting pixels by their similarity to the central pixel (Yoon and Kweon 2006). 

Once you have aggregated the costs in the DSI, pick the winner at each pixel (winner-take- 
all), and then optionally perform one or more of the following post-processing steps: 

1. compute matches both ways and pick only the reliable matches (draw the others in 
another color); 

2. tag matches that are unsure (whose confidence is too low); 

3. fill in the matches that are unsure from neighboring values; 

4. refine your matches to sub-pixel disparity by either fitting a parabola to the DSI values 
around the winner or by using an iteration of Lukas-Kanade. 

Ex 11.6: Optimization-based stereo Compute the disparity space image (DSI) volume us- 
ing one of the techniques you implemented in Exercise 1 1 .4 and then implement one (or more) 
of the global optimization techniques described in Section 11.5 to compute the depth map. 
Potential choices include: 

• dynamic programming or scanline optimization (relatively easy); 

• semi-global optimization (Hirschmiiller 2008), which is a simple extension of scanline 
optimization and performs well; 

• graph cuts using alpha expansions (Boykov, Veksler, and Zabih 2001), for which you 
will need to find a max-flow or min-cut algorithm (http://vision.middlebury.edu/stereo); 

• loopy belief propagation (Appendix B.5.3). 

Evaluate your algorithm by running it on the Middlebury stereo data sets. 

How well does your algorithm do against local aggregation (Yoon and Kweon 2006)? 
Can you think of some extensions or modifications to make it even better? 
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Ex 11.7: View interpolation, revisited Compute a dense depth map using one of the tech- 
niques you developed above and use it (or, better yet, a depth map for each source image) to 
generate smooth in-between views from a stereo data set. 

Compare your results against using the ground truth depth data (if available). 

What kinds of artifacts do you see? Can you think of ways to reduce them? 

More details on implementing such algorithms can be found in Section 13.1 and Exercises 
13.1-13.4. 


Ex 11.8: Multi-frame stereo Extend one of your previous techniques to use multiple input 
frames (Section 11.6) and try to improve the results you obtained with just two views. 

If helpful, try using temporal selection (Kang and Szeliski 2004) to deal with the increased 
number of occlusions in multi-frame data sets. 

You can also try to simultaneously estimate multiple depth maps and make them consis- 
tent (Kolmogorov and Zabih 2002; Kang and Szeliski 2004). 

Test your algorithms out on some standard multi-view data sets. 

Ex 11.9: Volumetric stereo Implement voxel coloring (Seitz and Dyer 1999) as a simple 
extension to the plane sweep algorithm you implemented in Exercise 1 1 .4. 

1. Instead of computing the complete DSI all at once, evaluate each plane one at a time 
from front to back. 

2. Tag every voxel whose photoconsistency is below a certain threshold as being part of 
the object and remember its average (or robust) color (Seitz and Dyer 1999; Eisert, 
Steinbach, and Girod 2000; Kutulakos 2000; Slabaugh, Culbertson, Slabaugh et al. 
2004). 

3. Erase the input pixels corresponding to tagged voxels in the input images, e.g., by 
setting their alpha value to 0 (or to some reduced number, depending on occupancy). 

4. As you evaluate the next plane, use the source image alpha values to modify your 
photoconsistency score, e.g., only consider pixels that have full alpha or weight pixels 
by their alpha values. 

5. If the cameras are not all on the same side of your plane sweeps, use space carving 
(Kutulakos and Seitz 2000) to cycle through different subsets of source images while 
carving away the volume from different directions. 

Ex 11.10: Depth map merging Use the technique you developed for multi-frame stereo in 
Exercise 1 1 .8 or a different technique, such as the one described by Goesele, Snavely, Curless 
et al. (2007), to compute a depth map for every input image. 
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Merge these depth maps into a coherent 3D model, e.g., using Poisson surface reconstruc- 
tion (Kazhdan, Bolitho, and Hoppe 2006). 

Ex 11.11: Shape from silhouettes Build a silhouette-based volume reconstruction algo- 
rithm (Section 1 1 .6.2). Use an octree or some other representation of your choosing. 
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(g) (h) (i) 

Figure 12.1 3D shape acquisition and modeling techniques: (a) shaded image (Zhang, Tsai, 
Cryer et al. 1999) © 1999 IEEE; (b) texture gradient (Garding 1992) © 1992 Springer; (c) 
real-time depth from focus (Nayar, Watanabe, and Noguchi 1996) © 1996 IEEE; (d) scanning 
a scene with a stick shadow (Bouguet and Perona 1999) © 1999 Springer; (e) merging range 
maps into a 3D model (Curless and Levoy 1996) © 1996 ACM; (f) point-based surface 
modeling (Pauly, Keiser, Kobbelt et al. 2003) © 2003 ACM; (g) automated modeling of a 
3D building using lines and planes (Werner and Zisserman 2002) © 2002 Springer; (h) 3D 
face model from spacetime stereo (Zhang, Snavely, Curless et al. 2004) © 2004 ACM; (i) 
person tracking (Sigal, Bhatia, Roth et al. 2004) © 2004 IEEE. 







12 3D reconstruction 


579 


As we saw in the previous chapter, a variety of stereo matching techniques have been de- 
veloped to reconstruct high quality 3D models from two or more images. However, stereo 
is just one of the many potential cues that can be used to infer shape from images. In this 
chapter, we investigate a number of such techniques, which include not only visual cues such 
as shading and focus, but also techniques for merging multiple range or depth images into 3D 
models, as well as techniques for reconstructing specialized models, such as heads, bodies, 
or architecture. 

Among the various cues that can be used to infer shape, the shading on a surface (Fig- 
ure 12.1a) can provide a lot of information about local surface orientations and hence overall 
surface shape (Section 12.1.1). This approach becomes even more powerful when lights 
shining from different directions can be turned on and off separately ( photometric stereo). 
Texture gradients (Figure 12.1b), i.e., the foreshortening of regular patterns as the surface 
slants or bends away from the camera, can provide similar cues on local surface orientation 
(Section 12.1.2). Focus is another powerful cue to scene depth, especially when two or more 
images with different focus settings are used (Section 12.1.3). 

3D shape can also be estimated using active illumination techniques such as light stripes 
(Figure 12. Id) or time of flight range finders (Section 12.2). The partial surface models 
obtained using such techniques (or passive image-based stereo) can then be merged into more 
coherent 3D surface models (Figure 12. le), as discussed in Section 12.2.1. Such techniques 
have been used to construct highly detailed and accurate models of cultural heritage such as 
historic sites (Section 12.2.2). The resulting surface models can then be simplified to support 
viewing at different resolutions and streaming across the Web (Section 12.3.2). An alternative 
to working with continuous surfaces is to represent 3D surfaces as dense collections of 3D 
oriented points (Section 12.4) or as volumetric primitives (Section 12.5). 

3D modeling can be more efficient and effective if we know something about the objects 
we are trying to reconstruct. In Section 12.6, we look at three specialized but commonly 
occurring examples, namely architecture (Figure 12. lg), heads and faces (Figure 12. lh), and 
whole bodies (Figure 12. li). In addition to modeling people, we also discuss techniques for 
tracking them. 

The last stage of shape and appearance modeling is to extract some textures to paint onto 
our 3D models (Section 12.7). Some techniques go beyond this and actually estimate full 
BRDFs (Section 12.7.1). 

Because there exists such a large variety of techniques to perform 3D modeling, this 
chapter does not go into detail on any one of these. Readers are encouraged to find more 
information in the cited references or more specialized publications and conferences de- 
voted to these topics, e.g., the International Symposium on 3D Data Processing, Visualiza- 
tion, and Transmission (3DPVT), the International Conference on 3D Digital Imaging and 
Modeling (3DIM), the International Conference on Automatic Face and Gesture Recognition 
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(FG), the IEEE Workshop on Analysis and Modeling of Faces and Gestures, and the Interna- 
tional Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences 
(THEMIS). 

12.1 Shape from X 

In addition to binocular disparity, shading, texture, and focus all play a role in how we per- 
ceive shape. The study of how shape can be inferred from such cues is sometimes called 
shape from X, since the individual instances are called shape from shading, shape from tex- 
ture, and shape from focus. 1 In this section, we look at these three cues and how they can 
be used to reconstruct 3D geometry. A good overview of all these topics can be found in the 
collection of papers on physics-based shape inference edited by Wolff, Shafer, and Healey 
(1992b). 

12.1.1 Shape from shading and photometric stereo 

When you look at images of smooth shaded objects, such as the ones shown in Figure 12.2, 
you can clearly see the shape of the object from just the shading variation. How is this 
possible? The answer is that as the surface normal changes across the object, the apparent 
brightness changes as a function of the angle between the local surface orientation and the 
incident illumination, as shown in Figure 2.15 (Section 2.2.2). 

The problem of recovering the shape of a surface from this intensity variation is known as 
shape from shading and is one of the classic problems in computer vision (Horn 1975). The 
collection of papers edited by Horn and Brooks (1989) is a great source of information on 
this topic, especially the chapter on variational approaches. The survey by Zhang, Tsai, Cryer 
et al. (1999) not only reviews more recent techniques, but also provides some comparative 
results. 

Most shape from shading algorithms assume that the surface under consideration is of a 
uniform albedo and reflectance, and that the light source directions are either known or can 
be calibrated by the use of a reference object. Under the assumptions of distant light sources 
and observer, the variation in intensity ( irradiance equation ) become purely a function of the 
local surface orientation, 

I{x, y) = R(p(x, y), q(x, y)), (12.1) 

where (p, q) = (z x , z y ) are the depth map derivatives and R(p , q) is called the reflectance 
map. For example, a diffuse (Lambertian) surface has a reflectance map that is the (non- 

1 We have already seen examples of shape from stereo, shape from profiles, and shape from silhouettes in Chap- 
ter 1 1 . 
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Figure 12.2 Synthetic shape from shading (Zhang, Tsai, Cryer et al. 1999) © 1999 IEEE: 
shaded images, (a-b) with light from in front (0,0, 1) and (c-d) with light the front right 
(1, 0, 1); (e-f) corresponding shape from shading reconstructions using the technique of Tsai 
and Shah (1994). 


negative) dot product (2.88) between the surface normal n = (p, q, l)/-^/l + p 2 + q 2 and 
the light source direction v = ( v x , v y ,v z ). 


R(p, q) = max 



pvx + qv y + Vz 

\J\ +p 2 + q 2 


( 12 . 2 ) 


where p is the surface reflectance factor (albedo). 

In principle. Equations (12.1-12.2) can be used to estimate (p. q) using non-linear least 
squares or some other method. Unfortunately, unless additional constraints are imposed, there 
are more unknowns per pixel (p. q) than there are measurements (I) . One commonly used 
constraint is the smoothness constraint, 


S B 


J pI+pI + qI + qI dx dy 


l|Vp|| 2 + \\Vq\\ 2 dxdy, 


(12.3) 


which we already saw in Section 3.7.1 (3.94). The other is the integrability constraint , 


£i = 


( Py - dx ) 2 dx dy, 


(12.4) 
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which arises naturally, since for a valid depth map z(x,y) with (p, q) = (z x , z y ), we have 
Py %xy ~yx Qx' 

Instead of first recovering the orientation fields (p, q) and integrating them to obtain a 
surface, it is also possible to directly minimize the discrepancy in the image formation equa- 
tion (12.1) while finding the optimal depth map z(x,y) (Horn 1990). Unfortunately, shape 
from shading is susceptible to local minima in the search space and, like other variational 
problems that involve the simultaneous estimation of many variables, can also suffer from 
slow convergence. Using multi -resolution techniques (Szeliski 1991a) can help accelerate 
the convergence, while using more sophisticated optimization techniques (Dupuis and Olien- 
sis 1994) can help avoid local minima. 

In practice, surfaces other than plaster casts are rarely of a single uniform albedo. Shape 
from shading therefore needs to be combined with some other technique or extended in some 
way to make it useful. One way to do this is to combine it with stereo matching (Fua and 
Leclerc 1995) or known texture (surface patterns) (White and Forsyth 2006). The stereo and 
texture components provide information in textured regions, while shape from shading helps 
fill in the information across uniformly colored regions and also provides finer information 
about surface shape. 

Photometric stereo. Another way to make shape from shading more reliable is to use mul- 
tiple light sources that can be selectively turned on and off. This technique is called photo- 
metric stereo, since the light sources play a role analogous to the cameras located at different 
locations in traditional stereo (Woodham 1981). 2 For each light source, we have a differ- 
ent reflectance map, Ii \ (p, q), q), etc. Given the corresponding intensities I \ , J 2 , etc. 

at a pixel, we can in principle recover both an unknown albedo p and a surface orientation 
estimate (p, q). 

For diffuse surfaces (12.2), if we parameterize the local orientation by h, we get (for 
non-shadowed pixels) a set of linear equations of the form 

I k = ph-v k , (12.5) 

from which we can recover ph using linear least squares. These equations are well condi- 
tioned as long as the (three or more) vectors v k are linearly independent, i.e., they are not 
along the same azimuth (direction away from the viewer). 

Once the surface normals or gradients have been recovered at each pixel, they can be 
integrated into a depth map using a variant of regularized surface fitting (3.100). (Nehab, 
Rusinkiewicz, Davis el al. (2005) and Harker and O’Leary (2008) have produced some recent 
work in this area.) 

2 An alternative to turning lights on-and-off is to use three colored lights (Woodham 1994; Hernandez, Vogiatzis, 
Brostow et al. 2007; Hernandez and Vogiatzis 2010). 
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Figure 12.3 Synthetic shape from texture (Garding 1992) © 1992 Springer: (a) regular 

texture wrapped onto a curved surface and (b) the corresponding surface normal estimates. 
Shape from mirror reflections (Savarese, Chen, and Perona 2005) © 2005 Springer: (c) a 
regular pattern reflecting off a curved mirror gives rise to (d) curved lines, from which 3D 
point locations and normals can be inferred. 


When surfaces are specular, more than three light directions may be required. In fact, 
the irradiance equation given in (12.1) not only requires that the light sources and camera be 
distant from the surface, it also neglects inter-reflections, which can be a significant source 
of the shading observed on object surfaces, e.g., the darkening seen inside concave structures 
such as grooves and crevasses (Nayar, Ikeuchi, and Kanade 1991). 

12.1.2 Shape from texture 

The variation in foreshortening observed in regular textures can also provide useful informa- 
tion about local surface orientation. Figure 12.3 shows an example of such a pattern, along 
with the estimated local surface orientations. Shape from texture algorithms require a number 
of processing steps, including the extraction of repeated patterns or the measurement of local 
frequencies in order to compute local affine deformations, and a subsequent stage to infer lo- 
cal surface orientation. Details on these various stages can be found in the research literature 
(Witkin 1981; Ikeuchi 1981; Blostein and Ahuja 1987; Garding 1992; Malik and Rosenholtz 
1997; Lobay and Forsyth 2006). 

When the original pattern is regular, it is possible to fit a regular but slightly deformed 
grid to the image and use this grid for a variety of image replacement or analysis tasks (Liu, 
Collins, and Tsin 2004; Liu, Lin, and Hays 2004; Hays, Leordeanu, Efros et al. 2006; Lin, 
Hays, Wu et al. 2006; Park, Brocklehurst, Collins et al. 2009). This process becomes even 
easier if specially printed textured cloth patterns are used (White and Forsyth 2006; White, 
Crane, and Forsyth 2007). 

The deformations induced in a regular pattern when it is viewed in the reflection of a 
curved mirror, as shown in Figure 12.3c-d, can be used to recover the shape of the surface 



584 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



(a) (e) (f) (g) 


Figure 12.4 Real time depth from defocus (Nayar, Watanabe, and Noguchi 1996) © 1996 
IEEE: (a) the real-time focus range sensor, which includes a half-silvered mirror between the 
two telecentric lenses (lower right), a prism that splits the image into two CCD sensors (lower 
left), and an edged checkerboard pattern illuminated by a Xenon lamp (top); (b-c) input video 
frames from the two cameras along with (d) the corresponding depth map; (e-f) two frames 
(you can see the texture if you zoom in) and (g) the corresponding 3D mesh model. 


(Savarese, Chen, and Perona 2005; Rozenfeld, Shimshoni, and Lindenbaum 2007). It is is 
also possible to infer local shape information from specular flow, i.e., the motion of specu- 
larities when viewed from a moving camera (Oren and Nayar 1997; Zisserman, Giblin, and 
Blake 1989; Swaminathan, Kang, Szeliski et al. 2002). 


12.1.3 Shape from focus 

A strong cue for object depth is the amount of blur, which increases as the object’s surface 
moves away from the camera’s focusing distance. As shown in Figure 2. 19, moving the object 
surface away from the focus plane increases the circle of confusion, according to a formula 
that is easy to establish using similar triangles (Exercise 2.4). 

A number of techniques have been developed to estimate depth from the amount of de- 
focus (depth from defocus) (Pentland 1987; Nayar and Nakagawa 1994; Nayar, Watanabe, 
and Noguchi 1996; Watanabe and Nayar 1998; Chaudhuri and Rajagopalan 1999; Favaro 
and Soatto 2006). In order to make such a technique practical, a number issues need to be 
addressed: 

• The amount of blur increase in both directions as you move away from the focus plane. 

Therefore, it is necessary to use two or more images captured with different focus 
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Figure 12.5 Range data scanning (Curless and Levoy 1996) © 1996 ACM: (a) a laser dot 
on a surface is imaged by a CCD sensor; (b) a laser stripe (sheet) is imaged by the sensor (the 
deformation of the stripe encodes the distance to the object); (c) the resulting set of 3D points 
are turned into (d) a triangulated mesh. 

distance settings (Pentland 1987; Nayar, Watanabe, and Noguchi 1996) or to translate 
the object in depth and look for the point of maximum sharpness (Nayar and Nakagawa 
1994). 

• The magnification of the object can vary as the focus distance is changed or the object is 
moved. This can be modeled either explicitly (making correspondence more difficult) 
or using telecentric optics, which approximate an orthographic camera and require an 
aperture in front of the lens (Nayar, Watanabe, and Noguchi 1996). 

• The amount of defocus must be reliably estimated. A simple approach is to average the 
squared gradient in a region but this suffers from several problems, including the image 
magnification problem mentioned above. A better solution is to use carefully designed 
rational filters (Watanabe and Nayar 1998). 

Figure 12.4 shows an example of a real-time depth from defocus sensor, which employs 
two imaging chips at slightly different depths sharing a common optical path, as well as an 
active illumination system that projects a checkerboard pattern from the same direction. As 
you can see in Figure 12.4b-g, the system produces high-accuracy real-time depth maps for 
both static and dynamic scenes. 

12.2 Active rangefinding 

As we have seen in the previous section, actively lighting a scene, whether for the purpose 
of estimating normals using photometric stereo or for adding artificial texture for shape from 
defocus, can greatly improve the performance of vision systems. This kind of active illu- 
mination has been used from the earliest days of machine vision to construct highly reliable 


586 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



Figure 12.6 Shape scanning using cast shadows (Bouguet and Perona 1999) © 1999 
Springer: (a) camera setup with a point light source (a desk lamp without its reflector), a 
hand-held stick casting a shadow, and (b) the objects being scanned in front of two planar 
backgrounds, (c) Real-time depth map using a pulsed illumination system (Iddan and Yahav 
2001) © 2001 SPIE. 


sensors for estimating 3D depth images using a variety of rangefinding (or range sensing) 
techniques (Besl 1989; Curless 1999; Hebert 2000). 

One of the most popular active illumination sensors is a laser or light stripe sensor, which 
sweeps a plane of light across the scene or object while observing it from an offset viewpoint, 
as shown in Figure 12.5b (Rioux and Bird 1993; Curless and Levoy 1995). As the stripe falls 
across the object, it deforms its shape according to the shape of the surface it is illuminating. 
It is then a simple matter of using optical triangulation to estimate the 3D locations of all the 
points seen in a particular stripe. In more detail, knowledge of the 3D plane equation of the 
light stripe allows us to infer the 3D location corresponding to each illuminated pixel, as pre- 
viously discussed in (2.70-2.71). The accuracy of light striping techniques can be improved 
by finding the exact temporal peak in illumination for each pixel (Curless and Levoy 1995). 
The final accuracy of a scanner can be determined using slant edge modulation techniques, 
i.e., by imaging sharp creases in a calibration object (Goesele, Fuchs, and Seidel 2003). 

An interesting variant on light stripe rangefinding is presented by Bouguet and Perona 
(1999). Instead of projecting a light stripe, they simply wave a stick casting a shadow over a 
scene or object illuminated by a point light source such as a lamp or the sun (Figure 12.6a). 
As the shadow falls across two background planes whose orientation relative to the cam- 
era is known (or inferred during pre -calibration), the plane equation for each stripe can be 
inferred from the two projected lines, whose 3D equations are known (Figure 12.6b). The 
deformation of the shadow as it crosses the object being scanned then reveals its 3D shape, 
as with regular light stripe rangefinding (Exercise 12.2). This technique can also be used to 
estimate the 3D geometry of a background scene and how its appearance varies as it moves 
into shadow, in order to cast new shadows onto the scene (Chuang, Goldman, Curless et al. 
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2003) (Section 10.4.3). 

The time it takes to scan an object using a light stripe technique is proportional to the 
number of depth planes used, which is usually comparable to the number of pixels across 
an image. A much faster scanner can be constructed by turning different projector pixels on 
and off in a structured manner, e.g., using a binary or Gray code (Besl 1989). For example, 
let us assume that the LCD projector we are using has 1024 columns of pixels. Taking the 
10-bit binary code corresponding to each column’s address (0 . . . 1023), we project the first 
bit, then the second, etc. After 10 projections (e.g., a third of a second for a synchronized 
30Hz camera-projector system), each pixel in the camera knows which of the 1024 columns 
of projector light it is seeing. A similar approach can also be used to estimate the refractive 
properties of an object by placing a monitor behind the object (Zongker, Werner, Curless et al. 
1999; Chuang, Zongker, Hindorff el al. 2000) (Section 13.4). Very fast scanners can also be 
constructed with a single laser beam, i.e., a real-time flying spot optical triangulation scanner 
(Rioux, Bechthold, Taylor et al. 1987). 

If even faster, i.e., frame-rate, scanning is required, we can project a single textured pat- 
tern into the scene. Proesmans, Van Gool, and Defoort (1998) describe a system where a 
checkerboard grid is projected onto an object (e.g., a person’s face) and the deformation of 
the grid is used to infer 3D shape. Unfortunately, such a technique only works if the surface 
is continuous enough to link all of the grid points. 

A much better system can be constructed using high-speed custom illumination and sens- 
ing hardware. Iddan and Yahav (2001) describe the construction of their 3DV Zcam video- 
rate depth sensing camera, which projects a pulsed plane of light onto the scene and then 
integrates the returning light for a short interval, essentially obtaining time-of-flight mea- 
surement for the distance to individual pixels in the scene. A good description of earlier 
time-of-flight systems, including amplitude and frequency modulation schemes for LIDAR, 
can be found in (Besl 1989). 

Instead of using a single camera, it is also possible to construct an active illumination 
range sensor using stereo imaging setups. The simplest way to do this is to just project ran- 
dom stripe patterns onto the scene to create synthetic texture, which helps match textureless 
surfaces (Kang, Webb, Zitnick et al. 1995). Projecting a known series of stripes, just as in 
coded pattern single-camera rangefinding, makes the correspondence between pixels unam- 
biguous and allows for the recovery of depth estimates at pixels only seen in a single camera 
(Scharstein and Szeliski 2003). This technique has been used to produce large numbers of 
highly accurate registered multi-image stereo pairs and depth maps for the purpose of eval- 
uating stereo correspondence algorithms (Scharstein and Szeliski 2002; Hirschmiiller and 
Scharstein 2009) and learning depth map priors and parameters (Scharstein and Pal 2007). 

While projecting multiple patterns usually requires the scene or object to remain still, ad- 
ditional processing can enable the production of real-time depth maps for dynamic scenes. 
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Figure 12.7 Real-time dense 3D face capture using spacetime stereo (Zhang, Snavely, Cur- 
less et al. 2004) © 2004 ACM: (a) set of five consecutive video frames from one of two stereo 
cameras (every fifth frame is free of stripe patterns, in order to extract texture); (b) resulting 
high-quality 3D surface model (depth map visualized as a shaded rendering). 


The basic idea (Davis, Ramamoorthi, and Rusinkiewicz 2003; Zhang, Curless, and Seitz 
2003) is to assume that depth is nearly constant within a 3D space-time window around 
each pixel and to use the 3D window for matching and reconstruction. Depending on the 
surface shape and motion, this assumption may be error-prone, as shown in (Davis, Nahab, 
Ramamoorthi et al. 2005). To model shapes more accurately, Zhang, Curless, and Seitz 
(2003) model the linear disparity variation within the space-time window and show that bet- 
ter results can be obtained by globally optimizing disparity and disparity gradient estimates 
over video volumes (Zhang, Snavely, Curless et al. 2004). Figure 12.7 shows the results of 
applying this system to a person’s face; the frame-rate 3D surface model can then be used for 
further model-based fitting and computer graphics manipulation (Section 12.6.2). 

12.2.1 Range data merging 

While individual range images can be useful for applications such as real-time z-keying or fa- 
cial motion capture, they are often used as building blocks for more complete 3D object mod- 
eling. In such applications, the next two steps in processing are the registration (alignment) of 
partial 3D surface models and their integration into coherent 3D surfaces (Curless 1999). If 
desired, this can be followed by a model fitting stage using either parametric representations 
such as generalized cylinders (Agin and Binford 1976; Nevatia and Binford 1977; Marr and 
Nishihara 1978; Brooks 1981), superquadrics (Pentland 1986; Solina and Bajcsy 1990; Ter- 
zopoulos and Metaxas 1991), or non-parametric models such as triangular meshes (Boissonat 
1984) or physically-based models (Terzopoulos, Witkin, and Kass 1988; Delingette, Hebert, 
and Ikeuichi 1992; Terzopoulos and Metaxas 1991; Mclnerney and Terzopoulos 1993; Ter- 
zopoulos 1999). A number of techniques have also been developed for segmenting range 
images into simpler constituent surfaces (Hoover, Jean-Baptiste, Jiang et al. 1996). 

The most widely used 3D registration technique is the iterated closest point (ICP) algo- 
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rithm, which alternates between finding the closest point matches between the two surfaces 
being aligned and then solving a 3D absolute orientation problem (Section 6.1.5, (6.31-6.32) 
(Besl and McKay 1992; Chen and Medioni 1992; Zhang 1994; Szeliski and Lavallee 1996; 
Gold, Rangarajan, Lu et al. 1998; David, DeMenthon, Duraiswami et al. 2004; Li and Hart- 
ley 2007; Enqvist, Josephson, and Kahl 2009). 3 Since the two surfaces being aligned usually 
only have partial overlap and may also have outliers, robust matching criteria (Section 6.1.4 
and Appendix B.3) are typically used. In order to speed up the determination of the closest 
point, and also to make the distance-to-surface computation more accurate, one of the two 
point sets (e.g., the current merged model) can be converted into a signed distance function, 
optionally represented using an octree spline for compactness (Lavallee and Szeliski 1995). 
Variants on the basic ICP algorithm can be used to register 3D point sets under non-rigid de- 
formations, e.g., for medical applications (Feldmar and Ayache 1996; Szeliski and Lavallee 
1996). Color values associated with the points or range measurements can also be used as 
part of the registration process to improve robustness (Johnson and Kang 1997; Pulli 1999). 

Unfortunately, the ICP algorithm and its variants can only find a locally optimal alignment 
between 3D surfaces. If this is not known a priori , more global correspondence or search 
techniques, based on local descriptors invariant to 3D rigid transformations, need to be used. 
An example of such a descriptor is the spin image , which is a local circular projection of a 
3D surface patch around the local normal axis (Johnson and Hebert 1999). Another (earlier) 
example is the splash representation introduced by Stein and Medioni (1992). 

Once two or more 3D surfaces have been aligned, they can be merged into a single model. 
One approach is to represent each surface using a triangulated mesh and combine these 
meshes using a process that is sometimes called zippering (Soucy and Laurendeau 1992; 
Turk and Levoy 1994). Another, now more widely used, approach is to compute a signed 
distance function that fits all of the 3D data points (Hoppe, DeRose, Duchamp et al. 1992; 
Curless and Levoy 1996; Hilton, Stoddart, Illingworth et al. 1996; Wheeler, Sato, and Ikeuchi 
1998). 

Figure 12.8 shows one such approach, the volumetric range image processing (VRIP) 
technique developed by Curless and Levoy (1996), which first computes a weighted signed 
distance function from each range image and then merges them using a weighted averaging 
process. To make the representation more compact, run-length coding is used to encode 
the empty, seen, and varying (signed distance) voxels, and only the signed distance values 
near each surface are stored. 4 Once the merged signed distance function has been computed, 
a zero-crossing surface extraction algorithm, such as marching cubes (Lorensen and Cline 
1987), can be used to recover a meshed surface model. Figure 12.9 shows an example of the 

3 Some techniques, such as the one developed by Chen and Medioni (1992). use local surface tangent planes to 
make this computation more accurate and to accelerate convergence. 

4 An alternative, even more compact, representation could be to use octrees (Lavallee and Szeliski 1995). 
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Figure 12.8 Range data merging (Curless and Levoy 1996) © 1996 ACM: (a) two signed 
distance functions (top left) are merged with their (weights) bottom left to produce a com- 
bined set of functions (right column) from which an isosurface can be extracted (green dashed 
line); (b) the signed distance functions are combined with empty and unseen space labels to 
fill holes in the isosurface. 


complete range data merging and isosurface extraction pipeline. 

Volumetric range data merging techniques based on signed distance or characteristic 
(inside-outside) functions are also widely used to extract smooth well-behaved surfaces from 
oriented or unoriented sets of points (Hoppe, DeRose, Duchamp et al. 1992; Ohtake, Belyaev, 
Alexa et al. 2003; Kazhdan, Bolitho, and Hoppe 2006; Lempitsky and Boykov 2007; Zach, 
Pock, and Bischof 2007b; Zach 2008), as discussed in more detail in Section 12.5.1. 

12.2.2 Application : Digital heritage 

Active rangefinding technologies, combined with surface modeling and appearance model- 
ing techniques (Section 12.7), are widely used in the fields of archeological and historical 
preservation, which often also goes under the name digital heritage (MacDonald 2006). In 
such applications, detailed 3D models of cultural objects are acquired and later used for ap- 
plications such as analysis, preservation, restoration, and the production of duplicate artwork 
(Rioux and Bird 1993). 

A more recent example of such an endeavor is the Digital Michelangelo project of Levoy, 
Pulli, Curless et al. (2000), which used Cyberware laser stripe scanners and high-quality 
digital SLR cameras mounted on a large gantry to obtain detailed scans of Michelangelo’s 
David and other sculptures in Florence. The project also took scans of the Forma Urbis 
Romae, an ancient stone map of Rome that had shattered into pieces, for which new matches 
were obtained using digital techniques. The whole process, from initial planning, to software 
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(a) (b) (c) (d) (e) 


Figure 12.9 Reconstruction and hardcopy of the “Happy Buddha” statuette (Curless and 
Levoy 1996) © 1996 ACM: (a) photograph of the original statue after spray painting with 
matte gray; (b) partial range scan; (c) merged range scans; (d) colored rendering of the recon- 
structed model; (e) hardcopy of the model constructed using stereolithography. 


development, acquisition, and post-processing, took several years (and many volunteers), and 
produced a wealth of 3D shape and appearance modeling techniques as a result. 

Even larger-scale projects are now being attempted, for example, the scanning of com- 
plete temple sites such as Angkor-Thom (Ikeuchi and Sato 2001; Ikeuchi and Miyazaki 2007; 
Banno, Masuda, Oishi et al. 2008). Figure 12.10 shows details from this project, including a 
sample photograph, a detailed 3D (sculptural) head model scanned from ground level, and an 
aerial overview of the final merged 3D site model, which was acquired using a balloon. 


12.3 Surface representations 

In previous sections, we have seen different representations being used to integrate 3D range 
scans. We now look at several of these representations in more detail. Explicit surface 
representations, such as triangle meshes, splines (Farin 1992, 1996), and subdivision sur- 
faces (Stollnitz, DeRose, and Salesin 1996; Zorin, Schroder, and Sweldens 1996; Warren and 
Weimer 2001; Peters and Reif 2008), enable not only the creation of highly detailed models 
but also processing operations, such as interpolation (Section 12.3.1), fairing or smoothing, 
and decimation and simplification (Section 12.3.2). We also examine discrete point-based 
representations (Section 12.4) and volumetric representations (Section 12.5). 
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(a) 


(b) 


(c) 


Figure 12.10 Laser range modeling of the Bayon temple at Angkor-Thom (Banno, Masuda, 
Oishi et al. 2008) © 2008 Springer: (a) sample photograph from the site; (b) a detailed head 
model scanned from the ground; (c) final merged 3D model of the temple scanned using a 
laser range sensor mounted on a balloon. 

12.3.1 Surface interpolation 

One of the most common operations on surfaces is their reconstruction from a set of sparse 
data constraints, i.e. scattered data interpolation. When formulating such problems, surfaces 
may be parameterized as height fields f(x), as 3D parametric surfaces f(x), or as non- 
parametric models such as collections of triangles. 

In the section on image processing, we saw how two-dimensional function interpolation 
and approximation problems -{©} — > f(x) could be cast as energy minimization problems 
using regularization (Section 3.7.1 (3. 94-3. 98). 5 Such problems can also specify the locations 
of discontinuities in the surface as well as local orientation constraints (Terzopoulos 1986b; 
Zhang, Dugas-Phocion, Samson et al. 2002). 

One approach to solving such problems is to discretize both the surface and the energy 
on a discrete grid or mesh using finite element analysis (3.100-3.102) (Terzopoulos 1986b). 
Such problems can then be solved using sparse system solving techniques, such as multigrid 
(Briggs, Henson, and McCormick 2000) or hierarchically preconditioned conjugate gradient 
(Szeliski 2006b). The surface can also be represented using a hierarchical combination of 
multilevel B-splines (Lee, Wolberg, and Shin 1996). 

An alternative approach is to use radial basis (or kernel ) functions (Boult and Kender 
1986; Nielson 1993). To interpolate a field f(x) through (or near) a number of data values 
dj located at x t , the radial basis function approach uses 



( 12 . 6 ) 


5 The difference between interpolation and approximation is that the former requires the surface or function to 
pass through the data while the latter allows the function to pass near the data, and can therefore be used for surface 
smoothing as well. 
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where the weights, 

Wi(x) = K(\\x- Xi\\), (12.7) 

are computed using a radial basis (spherically symmetrical) function K(r). 

If we want the function f(x) to exactly interpolate the data points, the kernel functions 
must either be singular at the origin, lim r ^o K{r) — > oo (Nielson 1993), or a dense linear 
system must be solved to determine the magnitude associated with each basis function (Boult 
and Kender 1986). It turns out that, for certain regularized problems, e.g., (3.94—3.96), there 
exist radial basis functions (kernels) that give the same results as a full analytical solution 
(Boult and Kender 1986). Unfortunately, because the dense system solving is cubic in the 
number of data points, basis function approaches can only be used for small problems such 
as feature-based image morphing (Beier and Neely 1992). 

When a three-dimensional parametric surface is being modeled, the vector-valued func- 
tion / in (12.6) or (3.94-3.102) encodes 3D coordinates (x,y,z) on the surface and the 
domain x = ( s,t ) encodes the surface parameterization. One example of such surfaces are 
symmetry-seeking parametric models, which are elastically deformable versions of general- 
ized cylinders 6 (Terzopoulos, Witkin, and Kass 1987). In these models, s is the parameter 
along the spine of the deformable tube and t is the parameter around the tube. A variety of 
smoothness and radial symmetry forces are used to constrain the model while it is fitted to 
image-based silhouette curves. 

It is also possible to define non-parametric surface models such as general triangulated 
meshes and to equip such meshes (using finite element analysis) with both internal smooth- 
ness metrics and external data fitting metrics (Sander and Zucker 1990; Fua and Sander 1992; 
Delingette, Hebert, and Ikeuichi 1992; Mclnerney and Terzopoulos 1993). While most of 
these approaches assume a standard elastic deformation model, which uses quadratic inter- 
nal smoothness terms, it is also possible to use sub-linear energy models in order to better 
preserve surface creases (Diebel, Thrun, and Briinig 2006). Triangle meshes can also be aug- 
mented with either spline elements (Sullivan and Ponce 1998) or subdivision surfaces (Stoll- 
nitz, DeRose, and Salesin 1996; Zorin, Schroder, and Sweldens 1996; Warren and Weimer 
2001; Peters and Reif 2008) to produce surfaces with better smoothness control. 

Both parametric and non-parametric surface models assume that the topology of the sur- 
face is known and fixed ahead of time. For more flexible surface modeling, we can either rep- 
resent the surface as a collection of oriented points (Section 12.4) or use 3D implicit functions 
(Section 12.5.1), which can also be combined with elastic 3D surface models (Mclnerney and 
Terzopoulos 1993). 

6 A generalized cylinder (Brooks 1981) is a solid of revolution, i.e., the result of rotating a (usually smooth) curve 
around an axis. It can also be generated by sweeping a slowly varying circular cross-section along the axis. (These 
two interpretations are equivalent.) 
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(a) (b) (c) (d) 

Figure 12.11 Progressive mesh representation of an airplane model (Hoppe 1996) © 1996 
ACM: (a) base mesh M° (150 faces); (b) mesh M 175 (500 faces); (c) mesh M 425 (1000 
faces); (d) original mesh M = M n (13,546 faces). 


12.3.2 Surface simplification 

Once a triangle mesh has been created from 3D data, it is often desirable to create a hierarchy 
of mesh models, for example, to control the displayed level of detail (LOD) in a computer 
graphics application. (In essence, this is a 3D analog to image pyramids (Section 3.5).) One 
approach to doing this is to approximate a given mesh with one that has subdivision connec- 
tivity, over which a set of triangular wavelet coefficients can then be computed (Eck, DeRose, 
Duchamp el al. 1995). A more continuous approach is to use sequential edge collapse opera- 
tions to go from the original fine-resolution mesh to a coarse base-level mesh (Hoppe 1996). 
The resulting progressive mesh (PM) representation can be used to render the 3D model at 
arbitrary levels of detail, as shown in Figure 12.1 1. 


12.3.3 Geometry images 

While multi-resolution surface representations such as (Eck, DeRose, Duchamp el al. 1995; 
Hoppe 1996) support level of detail operations, they still consist of an irregular collection of 
triangles, which makes them more difficult to compress and store in a cache-efficient manner. 7 

To make the triangulation completely regular (uniform and gridded), Gu, Gortler, and 
Hoppe (2002) describe how to create geometry images by cutting surface meshes along well- 
chosen lines and “flattening” the resulting representation into a square. Figure 12.12a shows 
the resulting (x, y, z) values of the surface mesh mapped over the unit square, while Fig- 
ure 12.12b shows the associated {n x ,n y ,n z ) normal map, i.e., the surface normals associ- 
ated with each mesh vertex, which can be used to compensate for loss in visual fidelity if the 
original geometry image is heavily compressed. 

7 Subdivision triangulations, such as those in (Eck, DeRose, Duchamp et al. 1995), are semi-regular, i.e., regular 
(ordered and nested) within each subdivided base triangle. 
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H + 

(x,y,z) 

(a) 

Figure 12.12 Geometry images (Gu, Gortler, and Hoppe 2002) © 2002 ACM: (a) the 257 x 
257 geometry image defines a mesh over the surface; (b) the 512 x 512 normal map defines 
vertex normals; (c) final lit 3D model. 

12.4 Point-based representations 

As we mentioned previously, triangle-based surface models assume that the topology (and 
often the rough shape) of the 3D model is known ahead of time. While it is possible to 
re-mesh a model as it is being deformed or fitted, a simpler solution is to dispense with an 
explicit triangle mesh altogether and to have triangle vertices behave as oriented points, or 
particles, or surface elements (surfels) (Szeliski and Tonnesen 1992). 

In order to endow the resulting particle system with internal smoothness constraints, pair- 
wise interaction potentials can be defined that approximate the equivalent elastic bending 
energies that would be obtained using local finite-element analysis. 8 Instead of defining the 
finite element neighborhood for each particle (vertex) ahead of time, a soft influence function 
is used to couple nearby particles. The resulting 3D model can change both topology and par- 
ticle density as it evolves and can therefore be used to interpolate partial 3D data with holes 
(Szeliski, Tonnesen, and Terzopoulos 1993b). Discontinuities in both the surface orientation 
and crease curves can also be modeled (Szeliski, Tonnesen, and Terzopoulos 1993a). 

To render the particle system as a continuous surface, local dynamic triangulation heuris- 
tics (Szeliski and Tonnesen 1992) or direct surface element splatting (Pfister, Zwicker, van 
Baar et al. 2000) can be used. Another alternative is to first convert the point cloud into an 
implicit signed distance or inside-outside function, using either minimum signed distances 
to the oriented points (Hoppe, DeRose, Duchamp et al. 1992) or by interpolating a charac- 
teristic (inside-outside) function using radial basis functions (Turk and O’Brien 2002; Dinh, 
Turk, and Slabaugh 2002). Even greater precision over the implicit function fitting, including 
the ability to handle irregular point densities, can be obtained by computing a moving least 

8 As mentioned before, an alternative is to use sub-linear interaction potentials, which encourage the preservation 
of surface creases (Diebel, Thrun, and Briinig 2006). 
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Figure 12.13 Point-based surface modeling with moving least squares (MLS) (Pauly, Keiser, 
Kobbelt et al. 2003) © 2003 ACM: (a) a set of points (black dots) is turned into an implicit 
inside-outside function (black curve); (b) the signed distance to the nearest oriented point 
can serve as an approximation to the inside-outside distance; (c) a set of oriented points 
with variable sampling density representing a 3D surface (head model); (d) local estimate of 
sampling density, which is used in the moving least squares; (e) reconstructed continuous 3D 
surface. 


squares (MLS) estimate of the signed distance function (Alexa, Behr, Cohen-Or et al. 2003; 
Pauly, Keiser, Kobbelt et al. 2003), as shown in Figure 12.13. Further improvements can 
be obtained using local sphere fitting (Guennebaud and Gross 2007), faster and more accu- 
rate re-sampling (Guennebaud, Germann, and Gross 2008), and kernel regression to better 
tolerate outliers (Oztireli, Guennebaud, and Gross 2008). 


12.5 Volumetric representations 

A third alternative for modeling 3D surfaces is to construct 3D volumetric inside-outside 
functions. We already saw examples of this in Section 1 1.6.1, where we looked at voxel color- 
ing (Seitz and Dyer 1999), space carving (Kutulakos and Seitz 2000), and level set (Faugeras 
and Keriven 1998; Pons, Keriven, and Faugeras 2007) techniques for stereo matching, and 
Section 1 1.6.2, where we discussed using binary silhouette images to reconstruct volumes. 

In this section, we look at continuous implicit (inside-outside) functions to represent 3D 
shape. 


12.5.1 Implicit surfaces and level sets 

While polyhedral and voxel-based representations can represent three-dimensional shapes 
to an arbitrary precision, they lack some of the intrinsic smoothness properties available 
with continuous implicit surfaces, which use an indicator function (characteristic function) 
F(x , y , z ) to indicate which 3D points are inside F(x, y, z) < 0 or outside F(x, y,z) > 0 
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the object. 

An early example of using implicit functions to model 3D objects in computer vision are 
superquadrics, which are a generalization of quadric (e.g., ellipsoidal) parametric volumetric 
models. 


F(x,y,z) 



( 12 . 8 ) 


(Pentland 1986; Solina and Bajcsy 1990; Waithe and Ferrie 1991; Leonardis, Jaklic, and 
Solina 1997). The values of (ai, 02, <23) control the extent of model along each (x, y. z ) axis, 
while the values of (ei, £2) control how “square” it is. To model a wider variety of shapes, 
superquadrics are usually combined with either rigid or non-rigid deformations (Terzopoulos 
and Metaxas 1991; Metaxas and Terzopoulos 2002). Superquadric models can either be fit to 
range data or used directly for stereo matching. 

A different kind of implicit shape model can be constructed by defining a signed distance 
function over a regular three-dimensional grid, optionally using an octree spline to represent 
this function more coarsely away from its surface (zero-set) (Lavallee and Szeliski 1995; 
Szeliski and Lavallee 1996; Frisken, Perry, Rockwood el al. 2000; Ohtake, Belyaev, Alexa 
et al. 2003). We have already seen examples of signed distance functions being used to 
represent distance transforms (Section 3.3.3), level sets for 2D contour fitting and tracking 
(Section 5.1.4), volumetric stereo (Section 11.6.1), range data merging (Section 12.2.1), and 
point-based modeling (Section 12.4). The advantage of representing such functions directly 
on a grid is that it is quick and easy to look up distance function values for any ( x , y. z) 
location and also easy to extract the isosurface using the marching cubes algorithm (Lorensen 
and Cline 1987). The work of Ohtake, Belyaev, Alexa et al. (2003) is particularly notable 
since it allows for several distance functions to be used simultaneously and then combined 
locally to produce sharp features such as creases. 

Poisson surface reconstruction (Kazhdan, Bolitho, and Hoppe 2006) uses a closely related 
volumetric function, namely a smoothed 0/1 inside-outside (characteristic) function, which 
can be thought of as a clipped signed distance function. The gradients for this function are 
set to lie along oriented surface normals near known surface points and 0 elsewhere. The 
function itself is represented using a quadratic tensor-product B-spline over an octree, which 
provides a compact representation with larger cells away from the surface or in regions of 
lower point density, and also admits the efficient solution of the related Poisson equations 
(3.100-3.102), see Section 9.3.4 (Perez, Gangnet, and Blake 2003). 

It is also possible to replace the quadratic penalties used in the Poisson equations with 
L\ (total variation) constraints and still obtain a convex optimization problem, which can be 
solved using either continuous (Zach, Pock, and Bischof 2007b; Zach 2008) or discrete graph 
cut (Lempitsky and Boykov 2007) techniques. 
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Signed distance functions also play an integral role in level-set evolution equations ((Sec- 
tions 5.1.4 and 11.6.1), where the values of distance transforms on the mesh are updated as 
the surface evolves to fit multi-view stereo photoconsistency measures (Faugeras and Keriven 
1998). 

12.6 Model-based reconstruction 

When we know something ahead of time about the objects we are trying to model, we can 
construct more detailed and reliable 3D models using specialized techniques and representa- 
tions. For example, architecture is usually made up of large planar regions and other para- 
metric forms (such as surfaces of revolution), usually oriented perpendicular to gravity and 
to each other (Section 12.6.1). Heads and faces can be represented using low-dimensional, 
non-rigid shape models, since the variability in shape and appearance of human faces, while 
extremely large, is still bounded (Section 12.6.2). Human bodies or parts, such as hands, form 
highly articulated structures, which can be represented using kinematic chains of piecewise 
rigid skeletal elements linked by joints (Section 12.6.4). 

In this section, we highlight some of the main ideas, representations, and modeling algo- 
rithms used for these three cases. Additional details and references can be found in special- 
ized conferences and workshops devoted to these topics, e.g., the International Symposium on 
3D Data Processing, Visualization, and Transmission (3DPVT), the International Conference 
on 3D Digital Imaging and Modeling (3DIM), the International Conference on Automatic 
Face and Gesture Recognition (FG), the IEEE Workshop on Analysis and Modeling of Faces 
and Gestures, and the International Workshop on Tracking Humans for the Evaluation of their 
Motion in Image Sequences (THEMIS). 

12.6.1 Architecture 

Architectural modeling, especially from aerial photography, has been one of the longest stud- 
ied problems in both photogrammetry and computer vision (Walker and Herman 1988). Re- 
cently, the development of reliable image -based modeling techniques, as well as the preva- 
lence of digital cameras and 3D computer games, has spurred renewed interest in this area. 

The work by Debevec, Taylor, and Malik (1996) was one of the earliest hybrid geometry - 
and image-based modeling and rendering systems. Their Facade system combines an inter- 
active image-guided geometric modeling tool with model-based (local plane plus parallax) 
stereo matching and view-dependent texture mapping. During the interactive photogrammet- 
ric modeling phase, the user selects block elements and aligns their edges with visible edges 
in the input images (Figure 12.14a). The system then automatically computes the dimensions 
and locations of the blocks along with the camera positions using constrained optimization 
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Figure 12.14 Interactive architectural modeling using the Fa§ade system (Debevec, Taylor, 
and Malik 1996) © 1996 ACM: (a) input image with user-drawn edges shown in green; 
(b) shaded 3D solid model; (c) geometric primitives overlaid onto the input image; (d) final 
view-dependent, texture-mapped 3D model. 


(Figure 12.14b-c). This approach is intrinsically more reliable than general feature-based 
structure from motion, because it exploits the strong geometry available in the block primi- 
tives. Related work by Becker and Bove (1995), Horry, Anjyo, and Arai (1997), and Crimin- 
isi, Reid, and Zisserman (2000) exploits similar information available from vanishing points. 
In the interactive, image-based modeling system of Sinha, Steedly, Szeliski el al. (2008), 
vanishing point directions are used to guide the user drawing of polygons, which are then 
automatically fitted to sparse 3D points recovered using structure from motion. 

Once the rough geometry has been estimated, more detailed offset maps can be com- 
puted for each planar face using a local plane sweep, which Debevec, Taylor, and Malik 
(1996) call model-based stereo. Finally, during rendering, images from different viewpoints 
are warped and blended together as the camera moves around the scene, using a process (re- 
lated to light field and Lumigraph rendering, see Section 13.3) called view-dependent texture 
mapping (Figure 12.14d). 

For interior modeling, instead of working with single pictures, it is more useful to work 
with panoramas, since you can see larger extents of walls and other structures. The 3D mod- 
eling system developed by Shum, Han, and Szeliski (1998) first constructs calibrated panora- 
mas from multiple images (Section 7.4) and then has the user draw vertical and horizontal 
lines in the image to demarcate the boundaries of planar regions. The lines are initially used 
to establish an absolute rotation for each panorama and are later used (along with the inferred 
vertices and planes) to optimize the 3D structure, which can be recovered up to scale from 
one or more images (Figure 12.15). 360° high dynamic range panoramas can also be used for 
outdoor modeling, since they provide highly reliable estimates of relative camera orientations 
as well as vanishing point directions (Antone and Teller 2002; Teller, Antone, Bodnar et al. 
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Figure 12.15 Interactive 3D modeling from panoramas (Shum, Han, and Szeliski 1998) 
© 1998 IEEE: (a) wide-angle view of a panorama with user-drawn vertical and horizontal 
(axis-aligned) lines; (b) single-view reconstruction of the corridors. 


2003). 

While earlier image-based modeling systems required some user authoring, Werner and 
Zisserman (2002) present a fully automated line-based reconstruction system. As described 
in Section 7.5.1, they first detect lines and vanishing points and use them to calibrate the 
camera; then they establish line correspondences using both appearance matching and tri- 
focal tensors, which enables them to reconstruct families of 3D line segments, as shown in 
Figure 12.16a. They then generate plane hypotheses, using both co-planar 3D lines and a 
plane sweep (Section 11.1.2) based on cross-correlation scores evaluated at interest points. 
Intersections of planes are used to determine the extent of each plane, i.e., an initial coarse ge- 
ometry, which is then refined with the addition of rectangular or wedge-shaped indentations 
and extrusions (Figure 12.16c). Note that when top-down maps of the buildings being mod- 
eled are available, these can be used to further constrain the 3D modeling process (Robertson 
and Cipolla 2002, 2009). The idea of using matched 3D lines for estimating vanishing point 
directions and dominant planes continues to be used in a number of recent fully automated 
image-based architectural modeling systems (Zebedin, Bauer, Karner et al. 2008; Micusfk 
and Kosecka 2009; Furukawa, Curless, Seitz et al. 2009b; Sinha, Steedly, and Szeliski 2009). 

Another common characteristic of architecture is the repeated use of primitives such as 
windows, doors, and colonnades. Architectural modeling systems can be designed to search 
for such repeated elements and to use them as part of the structure inference process (Dick, 
Torr, and Cipolla 2004; Mueller, Zeng, Wonka el al. 2007; Schindler, Krishnamurthy, Lublin- 
erman et al. 2008; Sinha, Steedly, Szeliski et al. 2008). 

The combination of all these techniques now makes it possible to reconstruct the structure 
of large 3D scenes (Zhu and Kanade 2008). For example, the Urbanscan system of Polle- 
feys, Nister, Frahm et al. (2008) reconstructs texture-mapped 3D models of city streets from 
videos acquired with a GPS-equipped vehicle. To obtain real-time performance, they use 
both optimized on-line structure-from-motion algorithms, as well as GPU implementations 
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Figure 12.16 Automated architectural reconstruction using 3D lines and planes (Werner 
and Zisserman 2002) © 2002 Springer: (a) reconstructed 3D lines, color coded by their van- 
ishing directions; (b) wire-frame model superimposed onto an input image; (c) triangulated 
piecewise-planar model with windows; (d) final texture-mapped model. 


of plane-sweep stereo aligned to dominant planes and depth map fusion. Cornelis, Leibe, 
Cornelis et al. (2008) present a related system that also uses plane-sweep stereo (aligned to 
vertical building facades) combined with object recognition and segmentation for vehicles. 
Micusfk and Kosecka (2009) build on these results using omni-directional images and super- 
pixel-based stereo matching along dominant plane orientations. Reconstruction directly from 
active range scanning data combined with color imagery that has been compensated for ex- 
posure and lighting variations is also possible (Chen and Chen 2008; Stamos, Liu, Chen et 
al. 2008; Troccoli and Allen 2008). 


12.6.2 Heads and faces 

Another area in which specialized shape and appearance models are extremely helpful is in 
the modeling of heads and faces. Even though the appearance of people seems at first glance 
to be infinitely variable, the actual shape of a person’s head and face can be described rea- 
sonably well using a few dozen parameters (Pighin, Hecker, Lischinski et al. 1998; Guenter, 
Grimm, Wood et al. 1998; DeCarlo, Metaxas, and Stone 1998; Blanz and Vetter 1999; Shan, 
Liu, and Zhang 2001). 

Figure 12. 17 shows an example of an image-based modeling system, where user-specified 
keypoints in several images are used to fit a generic head model to a person’s face. As you 
can see in Figure 12.17c, after specifying just over 100 keypoints, the shape of the face has 
become quite adapted and recognizable. Extracting a texture map from the original images 
and then applying it to the head model results in an animatable model with striking visual 
fidelity (Figure 12.18a). 

A more powerful system can be built by applying principal component analysis (PCA) to 
a collection of 3D scanned faces, which is a topic we discuss in Section 12.6.3. As you can 
see in Figure 12.19, it is then possible to fit morphable 3D models to single images and to 
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Figure 12.17 3D model fitting to a collection of images: (Pighin, Hecker, Lischinski et 
al. 1998) © 1998 ACM: (a) set of five input images along with user-selected keypoints; (b) 
the complete set of keypoints and curves; (c) three meshes — the original, adapted after 13 
keypoints, and after an additional 99 keypoints; (d) the partition of the image into separately 
animatable regions. 



(a) (b) 


Figure 12.18 Head and expression tracking and re-animation using deformable 3D models, 
(a) Models fit directly to five input video streams (Pighin, Szeliski, and Salesin 2002) © 
2002 Springer: The bottom row shows the results of re-animating a synthetic texture-mapped 
3D model with pose and expression parameters fitted to the input images in the top row. (b) 
Models fit to frame-rate spacetime stereo surface models (Zhang, Snavely, Curless et al. 2004) 
© 2004 ACM: The top row shows the input images with synthetic green markers overlaid, 
while the bottom row shows the fitted 3D surface model. 
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use such models for a variety of animation and visual effects (Blanz and Vetter 1999). It is 
also possible to design stereo matching algorithms that optimize directly for the head model 
parameters (Shan, Liu, and Zhang 2001; Kang and Jones 2002) or to use the output of real- 
time stereo with active illumination (Zhang, Snavely, Curless et al. 2004) (Figures 12.7 and 
12.18b). 

As the sophistication of 3D facial capture systems evolves, so does the detail and realism 
in the reconstructed models. Newer systems can capture (in real-time) not only surface details 
such as wrinkles and creases, but also accurate models of skin reflection, translucency, and 
sub-surface scattering (Weyrich, Matusik, Pfister el al. 2006; Golovinskiy, Matusik, ster el al. 
2006; Bickel, Botsch, Angst et al. 2007; Igarashi, Nishino, and Nayar 2007). 

Once a 3D head model has been constructed, it can be used in a variety of applications, 
such as head tracking (Toyama 1998; Lepetit, Pilet, and Fua 2004; Matthews, Xiao, and Baker 
2007), as shown in Figures 4.29 and 14.24, and face transfer, i.e., replacing one person’s 
face with another in a video (Bregler, Coveil, and Slaney 1997; Vlasic, Brand, Pfister el al. 
2005). Additional applications include face beautification by warping face images toward a 
more attractive “standard” (Leyvand, Cohen-Or, Dror et al. 2008), face de-identification for 
privacy protection (Gross, Sweeney, De la Torre et al. 2008), and face swapping (Bitouk, 
Kumar, Dhillon et al. 2008). 

12.6.3 Application : Facial animation 

Perhaps the most widely used application of 3D head modeling is facial animation. Once 
a parameterized 3D model of shape and appearance (surface texture) has been constructed, 
it can be used directly to track a person’s facial motions (Figure 12.18a) and to animate a 
different character with these same motions and expressions (Pighin, Szeliski, and Salesin 
2002). 

An improved version of such a system can be constructed by first applying principal com- 
ponent analysis (PCA) to the space of possible head shapes and facial appearances. Blanz 
and Vetter (1999) describe a system where they first capture a set of 200 colored range scans 
of faces (Figure 12.19a), which can be represented as a large collection of (X. Y, Z. R. G, B) 
samples (vertices). 9 In order for 3D morphing to be meaningful, corresponding vertices in 
different people’s scans must first be put into correspondence (Pighin, Hecker, Lischinski et 
al. 1998). Once this is done, PCA can be applied to more naturally parameterize the 3D mor- 
phable model. The flexibility of this model can be increased by performing separate analyses 
in different subregions, such as the eyes, nose, and mouth, just as in modular eigenspaces 
(Moghaddam and Pentland 1997). 

9 A cylindrical coordinate system provides a natural two-dimensional embedding for this collection, but such an 
embedding is not necessary to perform PCA. 
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Figure 12.19 3D morphable face model (Blanz and Vetter 1999) © 1999 ACM: (a) orig- 
inal 3D face model with the addition of shape and texture variations in specific directions: 
deviation from the mean (caricature), gender, expression, weight, and nose shape; (b) a 3D 
morphable model is fit to a single image, after which its weight or expression can be manip- 
ulated; (c) another example of a 3D reconstruction along with a different set of 3D manipula- 
tions such as lighting and pose change. 
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After computing a subspace representation, different directions in this space can be as- 
sociated with different characteristics such as gender, facial expressions, or facial features 
(Figure 12.19a). As in the work of Rowland and Perrett (1995), faces can be turned into 
caricatures by exaggerating their displacement from the mean image. 

3D morphable models can be fitted to a single image using gradient descent on the error 
between the input image and the re-synthesized model image, after an initial manual place- 
ment of the model in an approximately correct pose, scale, and location (Figures 12.19b-c). 
The efficiency of this fitting process can be increased using inverse compositional image 
alignment (8.64—8.65), as described by Romdhani and Vetter (2003). 

The resulting texture-mapped 3D model can then be modified to produce a variety of vi- 
sual effects, including changing a person’s weight or expression, or three-dimensional effects 
such as re-lighting or 3D video-based animation (Section 13.5.1). Such models can also be 
used for video compression, e.g., by only transmitting a small number of facial expression 
and pose parameters to drive a synthetic avatar (Eisert, Wiegand, and Girod 2000; Gao, Chen, 
Wang et al. 2003). 

3D facial animation is often matched to the performance of an actor, in what is known 
as performance-driven animation (Section 4.1.5) (Williams 1990). Traditional performance- 
driven animation systems use marker-based motion capture (Ma, Jones, Chiang el al. 2008), 
while some newer systems use video footage to control the animation (Buck, Finkelstein, 
Jacobs et al. 2000; Pighin, Szeliski, and Salesin 2002; Zhang, Snavely, Curless et al. 2004; 
Vlasic, Brand, Pfister et al. 2005). 

An example of the latter approach is the system developed for the film Benjamin Button, 
in which Digital Domain used the CONTOUR system from Mova 10 to capture actor Brad 
Pitt’s facial motions and expressions (Roble and Zafar 2009). CONTOUR uses a combina- 
tion of phosphorescent paint and multiple high-resolution video cameras to capture real-time 
3D range scans of the actor. These 3D models were then translated into Facial Action Cod- 
ing System (FACS) shape and expression parameters (Ekrnan and Friesen 1978) to drive a 
different (older) synthetically animated computer-generated imagery (CGI) character. 

12.6.4 Whole body modeling and tracking 

The topics of tracking humans, modeling their shape and appearance, and recognizing their 
activities, are some of the most actively studied areas of computer vision. Annual confer- 
ences * 11 and special journal issues (Hilton, Fua, and Ronfard 2006) are devoted to this sub- 
ject, and two recent surveys (Forsyth, Arikan, Ikemoto et al. 2006; Moeslund, Hilton, and 

10 http://www.mova.com. 

1 1 International Conference on Automatic Face and Gesture Recognition (FG), IEEE Workshop on Analysis and 
Modeling of Faces and Gestures, and International Workshop on Tracking Humans for the Evaluation of their Motion 
in Image Sequences (THEMIS). 
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Kruger 2006) each list over 400 papers devoted to these topics. 12 The HumanEva database 
of articulated human motions 13 contains multi-view video sequences of human actions along 
with corresponding motion capture data, evaluation code, and a reference 3D tracker based on 
particle filtering. The companion paper by Sigal, Balan, and Black (2010) not only describes 
the database and evaluation but also has a nice survey of important work in this field. 

Given the breadth of this area, it is difficult to categorize all of this research, especially 
since different techniques usually build on each other. Moeslund, Hilton, and Kruger (2006) 
divide their survey into initialization, tracking (which includes background modeling and 
segmentation), pose estimation, and action (activity) recognition. Forsyth, Arikan, Ikemoto et 
al. (2006) divide their survey into sections on tracking (background subtraction, deformable 
templates, flow, and probabilistic models), recovering 3D pose from 2D observations, and 
data association and body parts. They also include a section on motion synthesis, which is 
more widely studied in computer graphics (Arikan and Forsyth 2002; Kovar, Gleicher, and 
Pighin 2002; Fee, Chai, Reitsma et al. 2002; Fi, Wang, and Shum 2002; Pullen and Bregler 
2002), see Section 13.5.2. Another potential taxonomy for work in this field would be along 
the lines of whether 2D or 3D (or multi-view) images are used as input and whether 2D or 
3D kinematic models are used. 

In this section, we briefly review some of the more seminal and widely cited papers in the 
areas of background subtraction, initialization and detection, tracking with flow, 3D kinematic 
models, probabilistic models, adaptive shape modeling, and activity recognition. We refer the 
reader to the previously mentioned surveys for other topics and more details. 

Background subtraction. One of the first steps in many (but certainly not all) human track- 
ing systems is to model the background in order to extract the moving foreground objects 
(silhouettes) corresponding to people. Toyama, Krumm, Brumitt et al. (1999) review several 
difference matting and background maintenance (modeling) techniques and provide a good 
introduction to this topic. Stauffer and Grimson (1999) describe some techniques based on 
mixture models, while Sidenbladh and Black (2003) develop a more comprehensive treat- 
ment, which models not only the background image statistics but also the appearance of the 
foreground objects, e.g., their edge and motion (frame difference) statistics. 

Once silhouettes have been extracted from one or more cameras, they can then be mod- 
eled using deformable templates or other contour models (Baumberg and Hogg 1996; Wren, 
Azarbayejani, Darrell et al. 1997). Tracking such silhouettes over time supports the analysis 
of multiple people moving around a scene, including building shape and appearance models 

12 Older surveys include those by Gavrila (1999) and Moeslund and Granum (2001). Some surveys on gesture 
recognition, which we do not cover in this book, include those by Pavlovic, Sharma, and Huang (1997) and Yang, 
Ahuja, and Tabb (2002). 

13 http://vision.cs.brown.edu/humaneva/. 
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and detecting if they are carrying objects (Haritaoglu, Harwood, and Davis 2000; Mittal and 
Davis 2003; Dimitrijevic, Lepetit, and Fua 2006). 

Initialization and detection. In order to track people in a fully automated manner, it is 
necessary to first detect (or re-acquire) their presence in individual video frames. This topic 
is closely related to pedestrian detection , which is often considered as a kind of object recog- 
nition (Mori, Ren, Efros et al. 2004; Felzenszwalb and Huttenlocher 2005; Felzenszwalb, 
Me Allester, and Ramanan 2008), and is therefore treated in more depth in Section 14.1.2. 
Additional techniques for initializing 3D trackers based on 2D images include those described 
by Howe, Leventon, and Freeman (2000), Rosales and Sclaroff (2000), Shakhnarovich, Viola, 
and Darrell (2003), Sminchisescu, Kanaujia, Li et al. (2005), Agarwal and Triggs (2006), Lee 
and Cohen (2006), Sigal and Black (2006), and Stenger, Thayananthan, Torr et al. (2006). 

Single-frame human detection and pose estimation algorithms can sometimes be used by 
themselves to perform tracking (Ramanan, Forsyth, and Zisserman 2005; Rogez, Rihan, Ra- 
malingam et al. 2008; Bourdev and Malik 2009), as described in Section 4.1.4. More often, 
however, they are combined with frame-to-frame tracking techniques to provide better relia- 
bility (Fossati, Dimitrijevic, Lepetit et al. 2007; Andriluka, Roth, and Schiele 2008; Ferrari, 
Marin-Jimenez, and Zisserman 2008). 

Tracking with flow. The tracking of people and their pose from frame to frame can be en- 
hanced by computing optic flow or matching the appearance of their limbs from one frame 
to another. For example, the cardboard people model of Ju, Black, and Yacoob (1996) mod- 
els the appearance of each leg portion (upper and lower) as a moving rectangle, and uses 
optic flow to estimate their location in each subsequent frame. Cham and Rehg (1999) and 
Sidenbladh, Black, and Fleet (2000) track limbs using optical flow and templates, along with 
techniques for dealing with multiple hypotheses and uncertainty. Bregler, Malik, and Pullen 
(2004) use a full 3D model of limb and body motion, as described below. It is also possible to 
match the estimated motion field itself to some prototypes in order to identify the particular 
phase of a running motion or to match two low-resolution video portions in order to perform 
video replacement (Efros, Berg, Mori et al. 2003). 

3D kinematic models. The effectiveness of human modeling and tracking can be greatly 
enhanced using a more accurate 3D model of a person’s shape and motion. Underlying such 
representations, which are ubiquitous in 3D computer animation in games and special effects, 
is a kinematic model or kinematic chain, which specifies the length of each limb in a skeleton 
as well as the 2D or 3D rotation angles between the limbs or segments (Figure 12.20a-b). 
Inferring the values of the joint angles from the locations of the visible surface points is 
called inverse kinematics (IK) and is widely studied in computer graphics. 
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Figure 12.20 Tracking 3D human motion: (a) kinematic chain model for a human hand 
(Rehg, Morris, and Kanade 2003) © 2003, reprinted by permission of SAGE; (b) tracking a 
kinematic chain blob model in a video sequence (Bregler, Malik, and Pullen 2004) © 2004 
Springer; (c-d) probabilistic loose-limbed collection of body parts (Sigal, Bhatia, Roth et al. 
2004) 


Figure 12.20a shows the kinematic model for a human hand used by Rehg, Morris, and 
Kanade (2003) to track hand motion in a video. As you can see, the attachment points between 
the fingers and the thumb have two degrees of freedom, while the finger joints themselves 
have only one. Using this kind of model can greatly enhance the ability of an edge-based 
tracker to cope with rapid motion, ambiguities in 3D pose, and partial occlusions. 

Kinematic chain models are even more widely used for whole body modeling and tracking 
(O’Rourke and Badler 1980; Hogg 1983; Rohr 1994). One popular approach is to associate 
an ellipsoid or superquadric with each rigid limb in the kinematic model, as shown in Fig- 
ure 12.20b. This model can then be fitted to each frame in one or more video streams either 
by matching silhouettes extracted from known backgrounds or by matching and tracking the 
locations of occluding edges (Gavrila and Davis 1996; Kakadiaris and Metaxas 2000; Bre- 
gler, Malik, and Pullen 2004; Kehl and Van Gool 2006). Note that some techniques use 2D 
models coupled to 2D measurements, some use 3D measurements (range data or multi-view 
video) with 3D models, and some use monocular video to infer and track 3D models directly. 

It is also possible to use temporal models to improve the tracking of periodic motions, 
such as walking, by analyzing the joint angles as functions of time (Polana and Nelson 1997; 
Seitz and Dyer 1997; Cutler and Davis 2000). The generality and applicability of such tech- 
niques can be improved by learning typical motion patterns using principal component anal- 
ysis (Sidenbladh, Black, and Fleet 2000; Urtasun, Fleet, and Fua 2006). 

Probabilistic models. Because tracking can be such a difficult task, sophisticated proba- 
bilistic inference techniques are often used to estimate the likely states of the person being 
tracked. One popular approach, called particle filtering (Isard and Blake 1998), was origi- 
nally developed for tracking the outlines of people and hands, as described in Section 5.1.2 
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Figure 12.21 Estimating human shape and pose from a single image using a parametric 3D 
model (Guan, Weiss, Balan et al. 2009) © 2009 IEEE. 


(Figures 5. 6-5. 8). It was subsequently applied to whole -body tracking (Deutscher, Blake, 
and Reid 2000; Sidenbladh, Black, and Fleet 2000; Deutscher and Reid 2005) and continues 
to be used in modern trackers (Ong, Micilotta, Bowden et al. 2006). Alternative approaches 
to handling the uncertainty inherent in tracking include multiple hypothesis tracking (Cham 
and Rehg 1999) and inflated covariances (Sminchisescu and Triggs 2001). 

Figure 12.20c-d shows an example of a sophisticated spatio-temporal probabilistic graph- 
ical model called loose-limbed people, which models not only the geometric relationship be- 
tween various limbs, but also their likely temporal dynamics (Sigal, Bhatia, Roth et al. 2004). 
The conditional probabilities relating various limbs and time instances are learned from train- 
ing data, and particle filtering is used to perform the final pose inference. 

Adaptive shape modeling. Another essential component of whole body modeling and 
tracking is the fitting of parameterized shape models to visual data. As we saw in Sec- 
tion 12.6.3 (Figure 12.19), the availability of large numbers of registered 3D range scans can 
be used to create morphable models of shape and appearance (Allen, Curless, and Popovic 
2003). Building on this work, Anguelov, Srinivasan, Roller et al. (2005) develop a sophis- 
ticated system called SCAPE (Shape Completion and Animation for PEople), which first 
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acquires a large number of range scans of different people and of one person in different 
poses, and then registers these scans using semi-automated marker placement. The registered 
datasets are used to model the variation in shape as a function of personal characteristics and 
skeletal pose, e.g., the bulging of muscles as certain joints are flexed (Figure 12.21, top row). 
The resulting system can then be used for shape completion, i.e., the recovery of a full 3D 
mesh model from a small number of captured markers, by finding the best model parameters 
in both shape and pose space that fit the measured data. 

Because it is constructed completely from scans of people in close-fitting clothing and 
uses a parametric shape model, the SCAPE system cannot cope with people wearing loose- 
fitting clothing. Balan and Black (2008) overcome this limitation by estimating the body 
shape that fits within the visual hull of the same person observed in multiple poses, while 
Vlasic, Baran, Matusik et al. (2008) adapt an initial surface mesh fitted with a parametric 
shape model to better match the visual hull. 

While the preceding body fitting and pose estimation systems use multiple views to esti- 
mate body shape, even more recent work by Guan, Weiss, Balan et al. (2009) can fit a human 
shape and pose model to a single image of a person on a natural background. Manual ini- 
tialization is used to estimate a rough pose (skeleton) and height model, and this is then used 
to segment the person’s outline using the Grab Cut segmentation algorithm (Section 5.5). 
The shape and pose estimate are then refined using a combination of silhouette edge cues 
and shading information (Figure 12.21). The resulting 3D model can be used to create novel 
animations. 

Activity recognition. The final widely studied topic in human modeling is motion, activity, 
and action recognition (Bobick 1997; Hu, Tan, Wang et al. 2004; Hilton, Fua, and Ronfard 
2006). Examples of actions that are commonly recognized include walking and running, 
jumping, dancing, picking up objects, sitting down and standing up, and waving. Recent 
representative papers on these topics have been written by Robertson and Reid (2006), Smin- 
chisescu, Kanaujia, and Metaxas (2006), Weinland, Ronfard, and Boyer (2006), Yilmaz and 
Shah (2006), and Gorelick, Blank, Shechtman et al. (2007). 


12.7 Recovering texture maps and albedos 

After a 3D model of an object or person has been acquired, the final step in modeling is 
usually to recover a texture map to describe the object’s surface appearance. This first requires 
establishing a parameterization for the {u, v) texture coordinates as a function of 3D surface 
position. One simple way to do this is to associate a separate texture map with each triangle 
(or pair of triangles). More space-efficient techniques involve unwrapping the surface onto 
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one or more maps, e.g., using a subdivision mesh (Section 12.3.2) (Eck, DeRose, Duchamp 
et al. 1995) or a geometry image (Section 12.3.3) (Gu, Gortler, and Hoppe 2002). 

Once the (u, v) coordinates for each triangle have been fixed, the perspective projec- 
tion equations mapping from texture (it, v) to an image j’s pixel coordinates can be 

obtained by concatenating the affine (u, v ) — > (A', V. Z ) mapping with the perspective ho- 
mography (A, Y, Z) — > ( Uj,Vj ) (Szeliski and Shum 1997). The color values for the ( u , v) 
texture map can then be re-sampled and stored, or the original image can itself be used as the 
texture source using projective texture mapping (OpenGL-ARB 1997). 

The situation becomes more involved when more than one source image is available for 
appearance recovery, which is the usual case. One possibility is to use a view-dependent 
texture map (Section 13.1.1), in which a different source image (or combination of source 
images) is used for each polygonal face based on the angles between the virtual camera, the 
surface normals, and the source images (Debevec, Taylor, and Malik 1996; Pighin, Hecker, 
Lischinski et al. 1998). An alternative approach is to estimate a complete Surface Light Field 
for each surface point (Wood, Azuma, Aldinger et al. 2000), as described in Section 13.3.2. 

In some situations, e.g., when using models in traditional 3D games, it is preferable to 
merge all of the source images into a single coherent texture map during pre-processing. 
Ideally, each surface triangle should select the source image where it is seen most directly 
(perpendicular to its normal) and at the resolution best matching the texture map resolution. 14 
This can be posed as a graph cut optimization problem, where the smoothness term encour- 
ages adjacent triangles to use similar source images, followed by blending to compensate 
for exposure differences (Lempitsky and Ivanov 2007; Sinha, Steedly, Szeliski et al. 2008). 
Even better results can be obtained by explicitly modeling geometric and photometric mis- 
alignments between the source images (Shum and Szeliski 2000; Gal, Wexler, Ofek et al. 
2010 ). 

These kinds of approaches produce good results when the lighting stays fixed with respect 
to the object, i.e., when the camera moves around the object or space. When the lighting is 
strongly directional, however, and the object is being moved relative to this lighting, strong 
shading effects or specularities may be present, which will interfere with the reliable recov- 
ery of a texture (albedo) map. In this case, it is preferable to explicitly undo the shading 
effects (Section 12.1) by modeling the light source directions and estimating the surface re- 
flectance properties while recovering the texture map (Sato and Ikeuchi 1996; Sato, Wheeler, 
and Ikeuchi 1997; Yu and Malik 1998; Yu, Debevec, Malik et al. 1999). Figure 12.22 shows 
the results of one such approach, where the specularities are first removed while estimat- 
ing the matte reflectance component (albedo) and then later re-introduced by estimating the 
specular component k s in a Torrance-Sparrow reflection model (2.91). 

14 When surfaces are seen at oblique viewing angles, it may be necessary to blend different images together to 
obtain the best resolution (Wang, Kang, Szeliski et al. 2001). 
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(a) (b) (c) 


Figure 12.22 Estimating the diffuse albedo and reflectance parameters for a scanned 3D 
model (Sato, Wheeler, and Ikeuchi 1997) © 1997 ACM: (a) set of input images projected 
onto the model; (b) the complete diffuse reflection (albedo) model; (c) rendering from the 
reflectance model including the specular component. 

12.7.1 Estimating BRDFs 

A more ambitious approach to the problem of view-dependent appearance modeling is to 
estimate a general bidirectional reflectance distribution function (BRDF) for each point on an 
object’s surface. Dana, van Ginneken, Nayar et al. (1999), Jensen, Marschner, Levoy el al. 
(2001), and Lensch, Kautz, Goesele et al. (2003) present different techniques for estimating 
such functions, while Dorsey, Rushmeier, and Sillion (2007) and Weyrich, Lawrence, Lensch 
et al. (2008) present more recent surveys of the topics of BRDF modeling, recovery, and 
rendering. 

As we saw in Section 2.2.2 (2.81), the BRDF can be written as 


f r {9 u fa,9 r ,fa-, A), (12.9) 

where fa) and (9 r , fa) are the angles the incident v, and reflected v r light ray directions 
make with the local surface coordinate frame (d x , d y . h) shown in Figure 2.15. When mod- 
eling the appearance of an object, as opposed to the appearance of a patch of material, we 
need to estimate this function at every point (a;, y) on the object’s surface, which gives us the 
spatially varying BRDF, or SVBRDF (Weyrich, Lawrence, Lensch et al. 2008), 


fv (*G t/j 9j., <f>i) 9 r , (j)ri A) • (12.10) 

If sub-surface scattering effects are being modeled, such as the long-range transmission 
of light through materials such as alabaster, the eight-dimensional bidirectional scattering- 
surface reflectance-distribution function (BSSRDF) is used instead, 

fe{xi,yi,9i,fa,x e ,y e ,9 e ,fa; A), (12.11) 


where the e subscript now represents the emitted rather than the reflected light directions. 
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(a) (b) 

Figure 12.23 Image-based reconstruction of appearance and detailed geometry (Lensch, 
Kautz, Goesele et al. 2003) © 2003 ACM. (a) Appearance models (BRDFs) are re-estimated 
using divisive clustering, (b) In order to model detailed spatially varying appearance, each 
lumitexel is projected onto the basis formed by the clustered materials. 

Weyrich, Lawrence, Lensch et al. (2008) provide a nice survey of these and related topics, 
including basic photometry, BRDF models, traditional BRDF acquisition using gonio reflec- 
tometry (the precise measurement of visual angles and reflectances), multiplexed illumination 
(Scheduler, Nayar, and Belhumeur 2009), skin modeling (Debevec, Hawkins, Tchou et al. 
2000; Weyrich, Matusik, Pfister et al. 2006), and image-based acquisition techniques, which 
simultaneously recover an object’s 3D shape and reflectometry from multiple photographs. 

A nice example of this latter approach is the system developed by Lensch, Kautz, Goesele 
et al. (2003), who estimate locally varying BRDFs and refine their shape models using local 
estimates of surface normals. To build up their models, they first associate a lumitexels, which 
contains a 3D position, a surface normal, and a set of sparse radiance samples, with each 
surface point. Next, they cluster such lumitexels into materials that share common properties, 
using a Lafortune reflectance model (Lafortune, Foo, Torrance et al. 1997) and a divisive 
clustering approach (Figure 12.23a). Finally, in order to model detailed spatially varying 
appearance, each lumitexel (surface point) is projected onto the basis of clustered appearance 
models (Figure 12.23b). 

While most of the techniques discussed in this section require large numbers of views 
to estimate surface properties, a challenging future direction will be to take these techniques 
out of the lab and into the real world, and to combine them with regular and Internet photo 
image-based modeling approaches. 

12.7.2 Application : 3D photography 

The techniques described in this chapter for building complete 3D models from multiple im- 
ages and then recovering their surface appearance have opened up a whole new range of 
applications that often go under the name 3D photography. Pollefeys and Van Gool (2002) 
provide a nice introduction to this field, including the processing steps of feature matching, 
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structure from motion recovery, 15 dense depth map estimation, 3D model building, and tex- 
ture map recovery. A complete Web-based system for automatically performing all of these 
tasks, called ARC3D, is described by Vergauwen and Van Gool (2006) and Moons, Van Gool, 
and Vergauwen (2010). The latter paper provides not only an in-depth survey of this whole 
field but also a detailed description of their complete end-to-end system. 

An alternative to such fully automated systems is to put the user in the loop in what is 
sometimes called interactive computer vision, van den Hengel, Dick, Thormhlen et al. (2007) 
describe their VideoTrace system, which performs automated point tracking and 3D structure 
recovery from video and then lets the user draw triangles and surfaces on top of the resulting 
point cloud, as well as interactively adjusting the locations of model vertices. Sinha, Steedly, 
Szeliski et al. (2008) describe a related system that uses matched vanishing points in multiple 
images (Figure 4.45) to infer 3D line orientations and plane normals. These are then used to 
guide the user drawing axis-aligned planes, which are automatically fitted to the recovered 
3D point cloud. Fully automated variants on these ideas are described by Zebedin, Bauer, 
Karner et al. (2008), Furukawa, Curless, Seitz et al. (2009a), Furukawa, Curless, Seitz et al. 
(2009b), Micuslk and Kosecka (2009), and Sinha, Steedly, and Szeliski (2009). 

As the sophistication and reliability of these techniques continues to improve, we can ex- 
pect to see even more user-friendly applications for photorealistic 3D modeling from images 
(Exercise 12.8). 


12.8 Additional reading 

Shape from shading is one of the classic problems in computer vision (Horn 1975). Some 
representative papers in this area include those by Horn (1977), Ikeuchi and Horn (1981), 
Pentland (1984), Horn and Brooks (1986), Horn (1990), Szeliski (1991a), Mancini and Wolff 
(1992), Dupuis and Oliensis (1994), and Fua and Leclerc (1995). The collection of papers 
edited by Horn and Brooks (1989) is a great source of information on this topic, especially 
the chapter on variational approaches. The survey by Zhang, Tsai, Cryer et al. (1999) not 
only reviews more recent techniques but also provides some comparative results. 

Woodham (1981) wrote the seminal paper of photometric stereo. Shape from texture 
techniques include those by Witkin (1981), Ikeuchi (1981), Blostein and Ahuja (1987), Gard- 
ing (1992), Malik and Rosenholtz (1997), Liu, Collins, and Tsin (2004), Liu, Lin, and Hays 
(2004), Hays, Leordeanu, Efros et al. (2006), Lin, Hays, Wu et al. (2006), Lobay and Forsyth 
(2006), White and Forsyth (2006), White, Crane, and Forsyth (2007), and Park, Brockle- 
hurst, Collins et al. (2009). Good papers and books on depth from defocus have been written 
by Pentland (1987), Nayar and Nakagawa (1994), Nayar, Watanabe, and Noguchi (1996), 

15 These earlier steps are also discussed in Section 7.4.4. 
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Watanabe and Nayar (1998), Chaudhuri and Rajagopalan (1999), and Favaro and Soatto 
(2006). Additional techniques for recovering shape from various kinds of illumination ef- 
fects, including inter-reflections (Nayar, Ikeuchi, and Kanade 1991), are discussed in the 
book on shape recovery edited by Wolff, Shafer, and Healey (1992b). 

Active rangefinding systems, which use laser or natural light illumination projected into 
the scene, have been described by Besl (1989), Rioux and Bird (1993), Kang, Webb, Zit- 
nick et al. (1995), Curless and Levoy (1995), Curless and Levoy (1996), Proesmans, Van 
Gool, and Defoort (1998), Bouguet and Perona (1999), Curless (1999), Hebert (2000), Id- 
dan and Yahav (2001), Goesele, Fuchs, and Seidel (2003), Scharstein and Szeliski (2003), 
Davis, Ramamoorthi, and Rusinkiewicz (2003), Zhang, Curless, and Seitz (2003), Zhang, 
Snavely, Curless et al. (2004), and Moons, Van Gool, and Vergauwen (2010). Individual 
range scans can be aligned using 3D correspondence and distance optimization techniques 
such as iterated closest points and its variants (Besl and McKay 1992; Zhang 1994; Szeliski 
and Lavallee 1996; Johnson and Kang 1997; Gold, Rangarajan, Lu et al. 1998; Johnson 
and Hebert 1999; Pulli 1999; David, DeMenthon, Duraiswami et al. 2004; Li and Hartley 
2007; Enqvist, Josephson, and Kahl 2009). Once they have been aligned, range scans can 
be merged using techniques that model the signed distance of surfaces to volumetric sam- 
ple points (Hoppe, DeRose, Duchamp et al. 1992; Curless and Levoy 1996; Hilton, Stoddart, 
Illingworth et al. 1996; Wheeler, Sato, and Ikeuchi 1998; Kazhdan, Bolitho, and Hoppe 2006; 
Lempitsky and Boykov 2007; Zach, Pock, and Bischof 2007b; Zach 2008). 

Once constructed, 3D surfaces can be modeled and manipulated using a variety of three- 
dimensional representations, which include triangle meshes (Eck, DeRose, Duchamp et al. 
1995; Hoppe 1996), splines (Farin 1992, 1996; Lee, Wolberg, and Shin 1996), subdivision 
surfaces (Stollnitz, DeRose, and Salesin 1996; Zorin, Schroder, and Sweldens 1996; Warren 
and Weimer 2001; Peters and Reif 2008), and geometry images (Gu, Gortler, and Hoppe 
2002). Alternatively, they can be represented as collections of point samples with local ori- 
entation estimates (Hoppe, DeRose, Duchamp et al. 1992; Szeliski and Tonnesen 1992; Turk 
and O’Brien 2002; Pfister, Zwicker, van Baar et al. 2000; Alexa, Behr, Cohen-Or et al. 2003; 
Pauly, Keiser, Kobbelt et al. 2003; Diebel, Thrun, and Briinig 2006; Guennebaud and Gross 
2007; Guennebaud, Germann, and Gross 2008; Oztireli, Guennebaud, and Gross 2008). They 
can also be modeled using implicit inside-outside characteristic or signed distance functions 
sampled on regular or irregular (octree) volumetric grids (Lavallee and Szeliski 1995; Szeliski 
and Lavallee 1996; Frisken, Perry, Rockwood et al. 2000; Dinh, Turk, and Slabaugh 2002; 
Kazhdan, Bolitho, and Hoppe 2006; Lempitsky and Boykov 2007; Zach, Pock, and Bischof 
2007b; Zach 2008). 

The literature on model-based 3D reconstruction is extensive. For modeling architecture 
and urban scenes, both interactive and fully automated systems have been developed. A 
special journal issue devoted to the reconstruction of large-scale 3D scenes (Zhu and Kanade 
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2008) is a good source of references and Robertson and Cipolla (2009) give a nice description 
of a complete system. Lots of additional references can be found in Section 12.6.1. 

Face and whole body modeling and tracking is a very active sub-field of computer vision, 
with its own conferences and workshops, e.g., the International Conference on Automatic 
Face and Gesture Recognition (FG), the IEEE Workshop on Analysis and Modeling of Faces 
and Gestures, and the International Workshop on Tracking Humans for the Evaluation of their 
Motion in Image Sequences (THEMIS). Recent survey articles on the topic of whole body 
modeling and tracking include those by Forsyth, Arikan, Ikemoto el al. (2006), Moeslund, 
Hilton, and Kruger (2006), and Sigal, Balan, and Black (2010). 


12.9 Exercises 


Ex 12.1: Shape from focus Grab a series of focused images with a digital SLR set to man- 
ual focus (or get one that allows for programmatic focus control) and recover the depth of an 
object. 

1. Take some calibration images, e.g., of a checkerboard, so you can compute a mapping 
between the amount of defocus and the focus setting. 

2. Try both a fronto-parallel planar target and one which is slanted so that it covers the 
working range of the sensor. Which one works better? 

3. Now put a real object in the scene and perform a similar focus sweep. 

4. For each pixel, compute the local sharpness and fit a parabolic curve over focus settings 
to find the most in-focus setting. 

5. Map these focus settings to depth and compare your result to ground truth. If you are 
using a known simple object, such as sphere or cylinder (a ball or a soda can), it’s easy 
to measure its true shape. 

6. (Optional) See if you can recover the depth map from just two or three focus settings. 

7. (Optional) Use an LCD projector to project artificial texture onto the scene. Use a pair 
of cameras to compare the accuracy of your shape from focus and shape from stereo 
techniques. 

8. (Optional) Create an all-in-focus image using the technique of Agarwala, Dontcheva, 
Agrawala et al. (2004). 


Ex 12.2: Shadow striping Implement the handheld shadow striping system of Bouguet 
and Perona (1999). The basic steps include the following. 
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1 . Set up two background planes behind the object of interest and calculate their orienta- 
tion relative to the viewer, e.g., with fiducial marks. 

2. Cast a moving shadow with a stick across the scene; record the video or capture the 
data with a webcam. 

3. Estimate each light plane equation from the projections of the cast shadow against the 
two backgrounds. 

4. Triangulate to the remaining points on each curve to get a 3D stripe and display the 
stripes using a 3D graphics engine. 

5. (Optional) remove the requirement for a known second (vertical) plane and infer its 
location (or that of the light source) using the techniques described by Bouguet and 
Perona (1999). The techniques from Exercise 10.9 may also be helpful here. 

Ex 12.3: Range data registration Register two or more 3D datasets using either iterated 
closest points (ICP) (Besl and McKay 1992; Zhang 1994; Gold, Rangarajan, Lu el al. 1998) 
or octree signed distance fields (Szeliski and Lavallee 1996) (Section 12.2.1). 

Apply your technique to narrow-baseline stereo pairs, e.g., obtained by moving a cam- 
era around an object, using structure from motion to recover the camera poses, and using a 
standard stereo matching algorithm. 

Ex 12.4: Range data merging Merge the datasets that you registered in the previous exer- 
cise using signed distance fields (Curless and Levoy 1996; Hilton, Stoddart, Illingworth et al. 
1996). You can optionally use an octree to represent and compress this field if you already 
implemented it in the previous registration step. 

Extract a meshed surface model from the signed distance field using marching cubes and 
display the resulting model. 

Ex 12.5: Surface simplification Use progressive meshes (Hoppe 1996) or some other tech- 
nique from Section 12.3.2 to create a hierarchical simplification of your surface model. 

Ex 12.6: Architectural modeler Build a 3D interior or exterior model of some architec- 
tural structure, such as your house, from a series of handheld wide-angle photographs. 

1. Extract lines and vanishing points (Exercises 4.11-4.15) to estimate the dominant di- 
rections in each image. 


2. Use structure from motion to recover all of the camera poses and match up the vanish- 
ing points. 


618 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 


3. Let the user sketch the locations of the walls by drawing lines corresponding to wall 
bottoms, tops, and horizontal extents onto the images (Sinha, Steedly, Szeliski et al. 
2008) — see also Exercise 6.9. Do something similar for openings (doors and windows) 
and simple furniture (tables and countertops). 

4. Convert the resulting polygonal meshes into a 3D model (e.g., VRML) and optionally 
texture-map these surfaces from the images. 

Ex 12.7: Body tracker Download the video sequences from the HumanEva Web site. 16 
Either implement a human motion tracker from scratch or extend the code on that Web site 
(Sigal, Balan, and Black 2010) in some interesting way. 

Ex 12.8: 3D photography Combine all of your previously developed techniques to pro- 
duce a system that takes a series of photographs or a video and constructs a photorealistic 
texture-mapped 3D model. 


16 


http://vision.cs.brown.edu/humaneva/. 
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(g) (h) (i) 

Figure 13.1 Image-based and video-based rendering: (a) a 3D view of a Photo Tourism re- 
construction (Snavely, Seitz, and Szeliski 2006) © 2006 ACM; (b) a slice through a 4D light 
field (Gortler, Grzeszczuk, Szeliski et al. 1996) © 1996 ACM; (c) sprites with depth (Shade, 
Gortler, He et al. 1998) © 1998 ACM; (d) surface light field (Wood, Azuma, Aldinger et al. 
2000) © 2000 ACM; (e) environment matte in front of a novel background (Zongker, Werner, 
Curless et al. 1999) © 1999 ACM; (f) real-time video environment matte (Chuang, Zongker, 
Hindorff et al. 2000) © 2000 ACM; (g) Video Rewrite used to re-animate old video (Bregler, 
Covell, and Slaney 1997) © 1997 ACM; (h) video texture of a candle flame (Schodl, Szeliski, 
Salesin et al. 2000) © 2000 ACM; (i) video view interpolation (Zitnick, Kang, Uyttendaele 
et al. 2004) © 2004 ACM. 
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Over the last two decades, image-based rendering has emerged as one of the most exciting 
applications of computer vision (Kang, Li, Tong et al. 2006; Shum, Chan, and Kang 2007). 
In image-based rendering, 3D reconstruction techniques from computer vision are combined 
with computer graphics rendering techniques that use multiple views of a scene to create inter- 
active photo-realistic experiences, such as the Photo Tourism system shown in Figure 13.1a. 
Commercial versions of such systems include immersive street-level navigation in on-line 
mapping systems 1 and the creation of 3D Photosynths 2 from large collections of casually 
acquired photographs. 

In this chapter, we explore a variety of image-based rendering techniques, such as those 
illustrated in Figure 13.1. We begin with view interpolation (Section 13.1), which creates a 
seamless transition between a pair of reference images using one or more pre -computed depth 
maps. Closely related to this idea are view-dependent texture maps (Section 13.1.1), which 
blend multiple texture maps on a 3D model’s surface. The representations used for both the 
color imagery and the 3D geometry in view interpolation include a number of clever variants 
such as layered depth images (Section 13.2) and sprites with depth (Section 13.2.1). 

We continue our exploration of image-based rendering with the light field and Lumigrapli 
four-dimensional representations of a scene’s appearance (Section 13.3), which can be used 
to render the scene from any arbitrary viewpoint. Variants on these representations include 
the unstructured Lumigrapli (Section 13.3.1), surface light fields (Section 13.3.2), concentric 
mosaics (Section 13.3.3), and environment mattes (Section 13.4). 

The last part of this chapter explores the topic of video-based rendering , which uses one 
or more videos in order to create novel video-based experiences (Section 13.5). The topics 
we cover include video-based facial animation (Section 13.5.1), as well as video textures 
(Section 13.5.2), in which short video clips can be seamlessly looped to create dynamic real- 
time video-based renderings of a scene. We close with a discussion of 3D videos created from 
multiple video streams (Section 13.5.4), as well as video-based walkthroughs of environments 
(Section 13.5.5), which have found widespread application in immersive outdoor mapping 
and driving direction systems. 


13.1 View interpolation 

While the term image-based rendering first appeared in the papers by Chen (1995) and 
McMillan and Bishop (1995), the work on view interpolation by Chen and Williams (1993) 
is considered as the seminal paper in the field. In view interpolation, pairs of rendered color 
images are combined with their pre-computed depth maps to generate interpolated views that 

1 http://maps.bing.com and http://maps.google.com. 

2 http://photosynth.net. 
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Figure 13.2 View interpolation (Chen and Williams 1993) © 1993 ACM: (a) holes from 
one source image (shown in blue); (b) holes after combining two widely spaced images; (c) 
holes after combining two closely spaced images; (d) after interpolation (hole filling). 


mimic what a virtual camera would see in between the two reference views. 

View interpolation combines two ideas that were previously used in computer vision and 
computer graphics. The first is the idea of pairing a recovered depth map with the refer- 
ence image used in its computation and then using the resulting texture-mapped 3D model 
to generate novel views (Figure 11.1). The second is the idea of morphing (Section 3.6.3) 
(Figure 3.53), where correspondences between pairs of images are used to warp each refer- 
ence image to an in-between location while simultaneously cross-dissolving between the two 
warped images. 

Figure 13.2 illustrates this process in more detail. First, both source images are warped 
to the novel view, using both the knowledge of the reference and virtual 3D camera pose 
along with each image’s depth map (2.68-2.70). In the paper by Chen and Williams (1993), 
a forward warping algorithm (Algorithm 3.1 and Figure 3.46) is used. The depth maps are 
represented as quadtrees for both space and rendering time efficiency (Samet 1989). 

During the forward warping process, multiple pixels (which occlude one another) may 
land on the same destination pixel. To resolve this conflict, either a z-buffer depth value can 
be associated with each destination pixel or the images can be warped in back-to-front order, 
which can be computed based on the knowledge of epipolar geometry (Chen and Williams 
1993; Laveau and Faugeras 1994; McMillan and Bishop 1995). 

Once the two reference images have been warped to the novel view (Figure 13.2a-b), they 
can be merged to create a coherent composite (Figure 13.2c). Whenever one of the images 
has a hole (illustrated as a cyan pixel), the other image is used as the final value. When both 
images have pixels to contribute, these can be blended as in usual morphing, i.e., according 
to the relative distances between the virtual and source cameras. Note that if the two images 
have very different exposures, which can happen when performing view interpolation on real 
images, the hole-filled regions and the blended regions will have different exposures, leading 
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to subtle artifacts. 

The final step in view interpolation (Figure 13. 2d) is to fill any remaining holes or cracks 
due to the forward warping process or lack of source data (scene visibility). This can be done 
by copying pixels from the further pixels adjacent to the hole. (Otherwise, foreground objects 
are subject to a “fattening effect”.) 

The above process works well for rigid scenes, although its visual quality (lack of alias- 
ing) can be improved using a two-pass, forward-backward algorithm (Section 1 3.2. 1) (Shade, 
Gortler, He et al. 1998) or full 3D rendering (Zitnick, Kang, Uyttendaele et al. 2004). In the 
case where the two reference images are views of a non-rigid scene, e.g., a person smiling 
in one image and frowning in the other, view morphing, which combines ideas from view 
interpolation with regular morphing, can be used (Seitz and Dyer 1996). 

While the original view interpolation paper describes how to generate novel views based 
on similar pre-computed (linear perspective) images, the plenoptic modeling paper of McMil- 
lan and Bishop (1995) argues that cylindrical images should be used to store the pre-computed 
rendering or real-world images. (Chen 1995) also propose using environment maps (cylin- 
drical, cubic, or spherical) as source images for view interpolation. 


13.1.1 View-dependent texture maps 

View-dependent texture maps (Debevec, Taylor, and Malik 1996) are closely related to view 
interpolation. Instead of associating a separate depth map with each input image, a single 3D 
model is created for the scene, but different images are used as texture map sources depending 
on the virtual camera’s current position (Figure 13.3a). 3 

In more detail, given a new virtual camera position, the similarity of this camera’s view of 
each polygon (or pixel) is compared to that of potential source images. The images are then 
blended using a weighting that is inversely proportional to the angles cci between the virtual 
view and the source views (Figure 13.3a). Even though the geometric model can be fairly 
coarse (Figure 13.3b), blending between different views gives a strong sense of more detailed 
geometry because of the parallax (visual motion) between corresponding pixels. While the 
original paper performs the weighted blend computation separately at each pixel or coarsened 
polygon face, follow-on work by Debevec, Yu, and Borshukov (1998) presents a more effi- 
cient implementation based on precomputing contributions for various portions of viewing 
space and then using projective texture mapping (OpenGL-ARB 1997). 

The idea of view-dependent texture mapping has been used in a large number of sub- 
sequent image-based rendering systems, including facial modeling and animation (Pighin, 

3 The term image-based modeling, which is now commonly used to describe the creation of texture-mapped 3D 
models from multiple images, appears to have first been used by Debevec, Taylor, and Malik (1996), who also used 
the term photogrammetric modeling to describe the same process. 
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Figure 13.3 View-dependent texture mapping (Debevec, Taylor, and Malik 1996) © 1996 
ACM. (a) The weighting given to each input view depends on the relative angles between the 
novel (virtual) view and the original views; (b) simplified 3D model geometry; (c) with view- 
dependent texture mapping, the geometry appears to have more detail (recessed windows). 


Hecker, Lischinski et al. 1998) and 3D scanning and visualization (Pulli, Abi-Rached, Duchamp 
et al. 1998). Closely related to view-dependent texture mapping is the idea of blending be- 
tween light rays in 4D space, which forms the basis of the Lumigraph and unstructured Lu- 
migraph systems (Section 13.3) (Gortler, Grzeszczuk, Szeliski et al. 1996; Buehler, Bosse, 
McMillan et al. 2001). 

In order to provide even more realism in their Fagade system, Debevec, Taylor, and Malik 
(1996) also include a model-based stereo component, which optionally computes an offset 
(parallax) map for each coarse planar facet of their 3D model. They call the resulting analysis 
and rendering system a hybrid geometry- and image-based approach, since it uses traditional 
3D geometric modeling to create the global 3D model, but then uses local depth offsets, along 
with view interpolation, to add visual realism. 

13.1.2 Application : Photo Tourism 

While view interpolation was originally developed to accelerate the rendering of 3D scenes 
on low-powered processors and systems without graphics acceleration, it turns out that it 
can be applied directly to large collections of casually acquired photographs. The Photo 
Tourism system developed by Snavely, Seitz, and Szeliski (2006) uses structure from motion 
to compute the 3D locations and poses of all the cameras taking the images, along with a 
sparse 3D point-cloud model of the scene (Section 7.4.4, Figure 7.11). 

To perform an image-based exploration of the resulting sea of images (Aliaga, Funkhouser, 
Yanovsky et al. 2003), Photo Tourism first associates a 3D proxy with each image. While a 
triangulated mesh obtained from the point cloud can sometimes form a suitable proxy, e.g., 
for outdoor terrain models, a simple dominant plane fit to the 3D points visible in each image 
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Figure 13.4 Photo Tourism (Snavely, Seitz, and Szeliski 2006): © 2006 ACM: (a) a 3D 
overview of the scene, with translucent washes and lines painted onto the planar impostors; 
(b) once the user has selected a region of interest, a set of related thumbnails is displayed 
along the bottom; (c) planar proxy selection for optimal stabilization (Snavely, Garg, Seitz el 
al. 2008) © 2008 ACM. 


often performs better, because it does not contain any erroneous segments or connections that 
pop out as artifacts. As automated 3D modeling techniques continue to improve, however, 
the pendulum may swing back to more detailed 3D geometry (Goesele, Snavely, Curless el 
al. 2007; Sinha, Steedly, and Szeliski 2009). 

The resulting image-based navigation system lets users move from photo to photo, ei- 
ther by selecting cameras from a top-down view of the scene (Figure 13.4a) or by selecting 
regions of interest in an image, navigating to nearby views, or selecting related thumbnails 
(Figure 13.4b). To create a background for the 3D scene, e.g., when being viewed from 
above, non-photorealistic techniques (Section 10.5.2), such as translucent color washes or 
highlighted 3D line segments, can be used (Figure 13.4a). The system can also be used to 
annotate regions of images and to automatically propagate such annotations to other pho- 
tographs. 

The 3D planar proxies used in Photo Tourism and the related Photosynth system from 
Microsoft result in non-photorealistic transitions reminiscent of visual effects such as “page 
flips”. Selecting a stable 3D axis for all the planes can reduce the amount of swimming and 
enhance the perception of 3D (Figure 13.4c) (Snavely, Garg, Seitz el al. 2008). It is also 
possible to automatically detect objects in the scene that are seen from multiple views and 
create “orbits” of viewpoints around such objects. Furthermore, nearby images in both 3D 
position and viewing direction can be linked to create “virtual paths”, which can then be used 
to navigate between arbitrary pairs of images, such as those you might take yourself while 
walking around a popular tourist site (Snavely, Garg, Seitz et al. 2008). 

The spatial matching of image features and regions performed by Photo Tourism can 
also be used to infer more information from large image collections. For example, Simon, 
Snavely, and Seitz (2007) show how the match graph between images of popular tourist sites 
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can be used to find the most iconic (commonly photographed) objects in the collection, along 
with their related tags. In follow-on work, Simon and Seitz (2008) show how such tags can 
be propagated to sub-regions of each image, using an analysis of which 3D points appear 
in the central portions of photographs. Extensions of these techniques to all of the world’s 
images, including the use of GPS tags where available, have been investigated as well (Li, 
Wu, Zach et al. 2008; Quack, Leibe, and Van Gool 2008; Crandall, Backstrom, Huttenlocher 
el al. 2009; Li, Crandall, and Huttenlocher 2009; Zheng, Zhao, Song et al. 2009). 

13.2 Layered depth images 

Traditional view interpolation techniques associate a single depth map with each source or 
reference image. Unfortunately, when such a depth map is warped to a novel view, holes and 
cracks inevitably appear behind the foreground objects. One way to alleviate this problem is 
to keep several depth and color values ( depth pixels) at every pixel in a reference image (or, 
at least for pixels near foreground-background transitions) (Figure 13.5). The resulting data 
structure, which is called a layered depth image (LDI), can be used to render new views using 
a back-to-front forward warping (splatting) algorithm (Shade, Gortler, He et al. 1998). 

13.2.1 Impostors, sprites, and layers 

An alternative to keeping lists of color-depth values at each pixel, as is done in the LDI, is 
to organize objects into different layers or sprites. The term sprite originates in the computer 
game industry, where it is used to designate flat animated characters in games such as Pac- 
Man or Mario Bros. When put into a 3D setting, such objects are often called impostors, 
because they use a piece of flat, alpha-matted geometry to represent simplified versions of 3D 
objects that are far away from the camera (Shade, Lischinski, Salesin et al. 1996; Lengyel and 
Snyder 1997; Torborg and Kajiya 1996). In computer vision, such representations are usually 
called layers (Wang and Adelson 1994; Baker, Szeliski, and Anandan 1998; Torr, Szeliski, 
and Anandan 1999; Birchfield, Natarajan, and Tomasi 2007). Section 8.5.2 discusses the 
topics of transparent layers and reflections, which occur on specular and transparent surfaces 
such as glass. 

While flat layers can often serve as an adequate representation of geometry and appear- 
ance for far-away objects, better geometric fidelity can be achieved by also modeling the 
per-pixel offsets relative to a base plane, as shown in Figures 13.5 and 13.6a-b. Such repre- 
sentations are called plane plus parallax in the computer vision literature (Kumar, Anandan, 
and Hanna 1994; Sawhney 1994; Szeliski and Coughlan 1997; Baker, Szeliski, and Anandan 
1998), as discussed in Section 8.5 (Figure 8.16). In addition to fully automated stereo tech- 
niques, it is also possible to paint in depth layers (Kang 1998; Oh, Chen, Dorsey et al. 2001; 
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Viewing Region 


Figure 13.5 A variety of image-based rendering primitives, which can be used depending 
on the distance between the camera and the object of interest (Shade, Gortler, He el al. 1998) 
© 1998 ACM. Closer objects may require more detailed polygonal representations, while 
mid-level objects can use a layered depth image (LDI), and far-away objects can use sprites 
(potentially with depth) and environment maps. 



(a) (b) (c) (d) 


Figure 13.6 Sprites with depth (Shade, Gortler, He el al. 1998) © 1998 ACM: (a) alpha- 
matted color sprite; (b) corresponding relative depth or parallax; (c) rendering without relative 
depth; (d) rendering with depth (note the curved object boundaries). 
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Shum, Sun, Yamazaki et al. 2004) or to infer their 3D structure from monocular image cues 
(Section 14.4.4) (Hoiem, Efros, and Hebert 2005b; Saxena, Sun, and Ng 2009). 

How can we render a sprite with depth from a novel viewpoint? One possibility, as with 
a regular depth map, is to just forward warp each pixel to its new location, which can cause 
aliasing and cracks. A better way, which we already mentioned in Section 3.6.2, is to first 
warp the depth (or (u, v ) displacement) map to the novel view, fill in the cracks, and then use 
higher-quality inverse warping to resample the color image (Shade, Gortler, He et al. 1998). 
Figure 13. 6d shows the results of applying such a two-pass rendering algorithm. From this 
still image, you can appreciate that the foreground sprites look more rounded; however, to 
fully appreciate the improvement in realism, you would have to look at the actual animated 
sequence. 

Sprites with depth can also be rendered using conventional graphics hardware, as de- 
scribed in (Zitnick, Kang, Uyttendaele et al. 2004). Rogmans, Fu, Bekaert et al. (2009) 
describe GPU implementations of both real-time stereo matching and real-time forward and 
inverse rendering algorithms. 


13.3 Light fields and Lumigraphs 

While image-based rendering approaches can synthesize scene renderings from novel view- 
points, they raise the following more general question: 

Is is possible to capture and render the appearance of a scene from all possible 
viewpoints and, if so, what is the complexity of the resulting structure? 

Fet us assume that we are looking at a static scene, i.e., one where the objects and illu- 
minants are fixed, and only the observer is moving around. Under these conditions, we can 
describe each image by the location and orientation of the virtual camera (6 dof) as well as 
its intrinsics (e.g., its focal length). However, if we capture a two-dimensional spherical im- 
age around each possible camera location, we can re-render any view from this information. 4 
Thus, taking the cross-product of the three-dimensional space of camera positions with the 
2D space of spherical images, we obtain the 5D plenoptic function of Adelson and Bergen 
(1991), which forms the basis of the image-based rendering system of McMillan and Bishop 
(1995). 

Notice, however, that when there is no light dispersion in the scene, i.e., no smoke or fog, 
all the coincident rays along a portion of free space (between solid or refractive objects) have 
the same color value. Under these conditions, we can reduce the 5D plenoptic function to 

4 Since we are counting dimensions, we ignore for now any sampling or resolution issues. 
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Figure 13.7 The Lumigraph (Gentler, Grzeszczuk, Szeliski et al. 1996) © 1996 ACM: (a) a 
ray is represented by its 4D two-plane parameters (s, t) and (u. v)\ (b) a slice through the 3D 
light field subset (u, v, s). 

the 4D light field of all possible rays (Gortler, Grzeszczuk, Szeliski et al. 1996; Levoy and 
Hanrahan 1996; Levoy 2006). 5 

To make the parameterization of this 4D function simpler, let us put two planes in the 
3D scene roughly bounding the area of interest, as shown in Figure 13.7a. Any light ray 
terminating at a camera that lives in front of the st plane (assuming that this space is empty) 
passes through the two planes at (s, t) and (u, v ) and can be described by its 4D coordinate 
(s, t, it, v). This diagram (and parameterization) can be interpreted as describing a family of 
cameras living on the st plane with their image planes being the uv plane. The uv plane 
can be placed at infinity, which corresponds to all the virtual cameras looking in the same 
direction. 

In practice, if the planes are of finite extent, the finite light slab L(s, t, u, v) can be used to 
generate any synthetic view that a camera would see through a (finite) viewport in the st plane 
with a view frustum that wholly intersects the far uv plane. To enable the camera to move 
all the way around an object, the 3D space surrounding the object can be split into multiple 
domains, each with its own light slab parameterization. Conversely, if the camera is moving 
inside a bounded volume of free space looking outward, multiple cube faces surrounding the 
camera can be used as ( s , t) planes. 

5 Levoy and Hanrahan (1996) borrowed the term light field from a paper by Gershun (1939). Another name for 
this representation is the photic field (Moon and Spencer 1981). 
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Figure 13.8 Depth compensation in the Lumigraph (Gortler, Grzeszczuk, Szeliski et al. 
1996) © 1996 ACM. To resample the (s, u) dashed light ray, the u parameter corresponding 
to each discrete ,s,; camera location is modified according to the out-of-plane depth z to yield 
new coordinates u and u'\ in (it, s) ray space, the original sample (A) is resampled from 
the ( Si,u ' ) and (sj+i, t/') samples, which are themselves linear blends of their adjacent (o) 
samples. 


Thinking about 4D spaces is difficult, so let us drop our visualization by one dimension. 
If we fix the row value t and constrain our camera to move along the s axis while looking 
at the uv plane, we can stack all of the stabilized images the camera sees to get the (it, v, s ) 
epipolar volume, which we discussed in Section 11.6. A “horizontal” cross-section through 
this volume is the well-known epipolar plane image (Bolles, Baker, and Marimont 1987), 
which is the us slice shown in Figure 13.7b. 

As you can see in this slice, each color pixel moves along a linear track whose slope 
is related to its depth (parallax) from the uv plane. (Pixels exactly on the uv plane appear 
“vertical”, i.e., they do not move as the camera moves along s.) Furthermore, pixel tracks 
occlude one another as their corresponding 3D surface elements occlude. Translucent pixels, 
however, composite over background pixels (Section 3.1.3, (3.8)) rather than occluding them. 
Thus, we can think of adjacent pixels sharing a similar planar geometry as EPI strips or EPI 
tubes (Criminisi, Kang, Swaminathan et al. 2005). 

The equations mapping from pixels (a;, y) in a virtual camera and the corresponding 
(s, t, it, v) coordinates are relatively straightforward to derive and are sketched out in Ex- 
ercise 13.7. It is also possible to show that the set of pixels corresponding to a regular ortho- 
graphic or perspective camera, i.e., one that has a linear projective relationship between 3D 
points and (x, y ) pixels (2.63), lie along a two-dimensional hyperplane in the (s, t, it, v) light 
field (Exercise 13.7). 
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While a light field can be used to render a complex 3D scene from novel viewpoints, a 
much better rendering (with less ghosting) can be obtained if something is known about its 
3D geometry. The Lumigraph system of Gortler, Grzeszczuk, Szeliski et al. (1996) extends 
the basic light field rendering approach by taking into account the 3D location of surface 
points corresponding to each 3D ray. 

Consider the ray (s, u) corresponding to the dashed line in Figure 13.8, which intersects 
the object’s surface at a distance z from the uv plane. When we look up the pixel’s color in 
camera Sj (assuming that the light field is discretely sampled on a regular 4D (s,t,u,v) grid), 
the actual pixel coordinate is u', instead of the original u value specified by the (s, u) ray. 
Similarly, for camera s,+i (where Si < s < Sj+i), pixel address u" is used. Thus, instead of 
using quadri-linear interpolation of the nearest sampled ( s , t. u, v) values around a given ray 
to determine its color, the ( u , v ) values are modified for each discrete (sj, t j) camera. 

Figure 13.8 also shows the same reasoning in ray space. Here, the original continuous- 
valued (s,u) ray is represented by a triangle and the nearby sampled discrete values are 
shown as circles. Instead of just blending the four nearest samples, as would be indicated 
by the vertical and horizontal dashed lines, the modified ( Sj,u ') and (s i+ i, u”) values are 
sampled instead and their values are then blended. 

The resulting rendering system produces images of much better quality than a proxy-free 
light field and is the method of choice whenever 3D geometry can be inferred. In subsequent 
work, Isaksen, McMillan, and Gortler (2000) show how a planar proxy for the scene, which 
is a simpler 3D model, can be used to simplify the resampling equations. They also describe 
how to create synthetic aperture photos, which mimic what might be seen by a wide-aperture 
lens, by blending more nearby samples (Levoy and Hanrahan 1996). A similar approach 
can be used to re-focus images taken with a plenoptic (microlens array) camera (Ng, Levoy, 
Breedif et al. 2005; Ng 2005) or a light field microscope (Levoy, Ng, Adams et al. 2006). It 
can also be used to see through obstacles, using extremely large synthetic apertures focused 
on a background that can blur out foreground objects and make them appear translucent 
(Wilburn, Joshi, Vaish et al. 2005; Vaish, Szeliski, Zitnick et al. 2006). 

Now that we understand how to render new images from a light field, how do we go about 
capturing such data sets? One answer is to move a calibrated camera with a motion control rig 
or gantry. 6 Another approach is to take handheld photographs and to determine the pose and 
intrinsic calibration of each image using either a calibrated stage or structure from motion. In 
this case, the images need to be rebinned into a regular 4D (s, t, u, v ) space before they can 
be used for rendering (Gortler, Grzeszczuk, Szeliski et al. 1996). Alternatively, the original 
images can be used directly using a process called the unstructured Lumigraph, which we 

6 See http://lightfield.stanford.edu/acq.html for a description of some of the gantries and camera arrays built at 
the Stanford Computer Graphics Laboratory. This Web site also provides a number of light field data sets that are a 
great source of research and project material. 
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describe below. 

Because of the large number of images involved, light fields and Lumigraphs can be quite 
voluminous to store and transmit. Fortunately, as you can tell from Figure 13.7b, there is 
a tremendous amount of redundancy (coherence) in a light field, which can be made even 
more explicit by first computing a 3D model, as in the Lumigraph. A number of techniques 
have been developed to compress and progressively transmit such representations (Gortler, 
Grzeszczuk, Szeliski et al. 1996; Levoy and Hanrahan 1996; Rademacher and Bishop 1998; 
Magnor and Girod 2000; Wood, Azuma, Aldinger et al. 2000; Shum, Kang, and Chan 2003; 
Magnor, Ramanathan, and Girod 2003; Shum, Chan, and Kang 2007). 


13.3.1 Unstructured Lumigraph 

When the images in a Lumigraph are acquired in an unstructured (irregular) manner, it can be 
counterproductive to resample the resulting light rays into a regularly binned (s, t, u, v ) data 
structure. This is both because resampling always introduces a certain amount of aliasing and 
because the resulting gridded light field can be populated very sparsely or irregularly. 

The alternative is to render directly from the acquired images, by finding for each light ray 
in a virtual camera the closest pixels in the original images. The unstructured Lumigraph ren- 
dering (ULR) system of Buehler, Bosse, McMillan et al. (2001) describes how to select such 
pixels by combining a number of fidelity criteria, including epipole consistency (distance of 
rays to a source camera’s center), angular deviation (similar incidence direction on the sur- 
face), resolution (similar sampling density along the surface), continuity (to nearby pixels), 
and consistency (along the ray). These criteria can all be combined to determine a weighting 
function between each virtual camera’s pixel and a number of candidate input cameras from 
which it can draw colors. To make the algorithm more efficient, the computations are per- 
formed by discretizing the virtual camera’s image plane using a regular grid overlaid with the 
polyhedral object mesh model and the input camera centers of projection and interpolating 
the weighting functions between vertices. 

The unstructured Lumigraph generalizes previous work in both image-based rendering 
and light field rendering. When the input cameras are gridded, the ULR behaves the same way 
as regular Lumigraph rendering. When fewer cameras are available but the geometry is accu- 
rate, the algorithm behaves similarly to view-dependent texture mapping (Section 13.1.1). 


13.3.2 Surface light fields 

Of course, using a two-plane parameterization for a light field is not the only possible choice. 
(It is the one usually presented first since the projection equations and visualizations are the 
easiest to draw and understand.) As we mentioned on the topic of light field compression. 
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Figure 13.9 Surface light fields (Wood, Azuma, Aldinger el al. 2000) © 2000 ACM: (a) 
example of a highly specular object with strong inter-reflections; (b) the surface light field 
stores the light emanating from each surface point in all visible directions as a “Lumisphere”. 


if we know the 3D shape of the object or scene whose light field is being modeled, we can 
effectively compress the field because nearby rays emanating from nearby surface elements 
have similar color values. 

In fact, if the object is totally diffuse, ignoring occlusions, which can be handled using 
3D graphics algorithms or z-buffering, all rays passing through a given surface point will 
have the same color value. Hence, the light field “collapses” to the usual 2D texture-map 
defined over an object’s surface. Conversely, if the surface is totally specular (e.g., mirrored), 
each surface point reflects a miniature copy of the environment surrounding that point. In the 
absence of inter-reflections (e.g., a convex object in a large open space), each surface point 
simply reflects the far-field environment map (Section 2.2.1), which again is two-dimensional. 
Therefore, is seems that re-parameterizing the 4D light field to lie on the object’s surface can 
be extremely beneficial. 

These observations underlie the surface light field representation introduced by Wood, 
Azuma, Aldinger et al. (2000). In their system, an accurate 3D model is built of the object 
being represented. Then the Lumisphere of all rays emanating from each surface point is 
estimated or captured (Figure 13.9). Nearby Lumispheres will be highly correlated and hence 
amenable to both compression and manipulation. 

To estimate the diffuse component of each Lumisphere, a median filtering over all visible 
exiting directions is first performed for each channel. Once this has been subtracted from the 
Lumisphere, the remaining values, which should consist mostly of the specular components, 
are reflected around the local surface normal (2.89), which turns each Lumisphere into a copy 
of the local environment around that point. Nearby Lumispheres can then be compressed 
using predictive coding, vector quantization, or principal component analysis. 
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The decomposition into a diffuse and specular component can also be used to perform 
editing or manipulation operations, such as re-painting the surface, changing the specular 
component of the reflection (e.g., by blurring or sharpening the specular Lumispheres), or 
even geometrically deforming the object while preserving detailed surface appearance. 

13.3.3 Application : Concentric mosaics 

A useful and simple version of light field rendering is a panoramic image with parallax, i.e., a 
video or series of photographs taken from a camera swinging in front of some rotation point. 
Such panoramas can be captured by placing a camera on a boom on a tripod, or even more 
simply, by holding a camera at arm’s length while rotating your body around a fixed axis. 

The resulting set of images can be thought of as a concentric mosaic (Shum and He 1999; 
Shum, Wang, Chai et al. 2002) or a layered depth panorama (Zheng, Kang, Cohen et al. 
2007). The term “concentric mosaic” comes from a particular structure that can be used to 
re-bin all of the sampled rays, essentially associating each column of pixels with the “radius” 
of the concentric circle to which it is tangent (Shum and He 1999; Peleg, Ben-Ezra, and Pritch 
2001). 

Rendering from such data structures is fast and straightforward. If we assume that the 
scene is far enough away, for any virtual camera location, we can associate each column of 
pixels in the virtual camera with the nearest column of pixels in the input image set. (For 
a regularly captured set of images, this computation can be performed analytically.) If we 
have some rough knowledge of the depth of such pixels, columns can be stretched vertically 
to compensate for the change in depth between the two cameras. If we have an even more 
detailed depth map (Peleg, Ben-Ezra, and Pritch 2001; Li, Shum, Tang et al. 2004; Zheng, 
Kang, Cohen et al. 2007), we can perform pixel-by-pixel depth corrections. 

While the virtual camera’s motion is constrained to lie in the plane of the original cameras 
and within the radius of the original capture ring, the resulting experience can exhibit complex 
rendering phenomena, such as reflections and translucencies, which cannot be captured using 
a texture-mapped 3D model of the world. Exercise 13.10 has you construct a concentric 
mosaic rendering system from a series of hand-held photos or video. 


13.4 Environment mattes 

So far in this chapter, we have dealt with view interpolation and light fields, which are tech- 
niques for modeling and rendering complex static scenes seen from different viewpoints. 

What if instead of moving around a virtual camera, we take a complex, refractive object, 
such as the water goblet shown in Figure 13.10, and place it in front of a new background? 
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Figure 13.10 Environment mattes: (a-b) a refractive object can be placed in front of a series 
of backgrounds and their light patterns will be correctly refracted (Zongker, Werner, Cur- 
less et al. 1999) (c) multiple refractions can be handled using a mixture of Gaussians model 
and (d) real-time mattes can be pulled using a single graded colored background (Chuang, 
Zongker, Hindorff et al. 2000) © 2000 ACM. 

Instead of modeling the 4D space of rays emanating from a scene, we now need to model 
how each pixel in our view of this object refracts incident light coming from its environment. 

What is the intrinsic dimensionality of such a representation and how do we go about 
capturing it? Let us assume that if we trace a light ray from the camera at pixel (x, y) toward 
the object, it is reflected or refracted back out toward its environment at an angle (c/>, 8). If 
we assume that other objects and illuminants are sufficiently distant (the same assumption we 
made for surface light fields in Section 13.3.2), this 4D mapping (x, y) —> (</>, 8) captures all 
the information between a refractive object and its environment. Zongker, Werner, Curless el 
al. (1999) call such a representation an environment matte , since it generalizes the process of 
object matting (Section 10.4) to not only cut and paste an object from one image into another 
but also take into account the subtle refractive or reflective interplay between the object and 
its environment. 

Recall from Equations (3.8) and (10.30) that a foreground object can be represented by 
its premultiplied colors and opacities (aF, a). Such a matte can then be composited onto a 
new background B using 


where i is the pixel under consideration. In environment matting, we augment this equation 
with a reflective or refractive term to model indirect light paths between the environment 
and the camera. In the original work of Zongker, Werner, Curless et al. (1999), this indirect 
component /, is modeled as 


where A, is the rectangular area of support for that pixel, li, is the colored reflectance or 


Ci = aiFi + (1 - ai)B it 


(13.1) 



(13.2) 
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transmittance (for colored glossy surfaces or glass), and B(x) is the background (environ- 
ment) image, which is integrated over the area A; (a:). In follow-on work, Chuang, Zongker, 
Hindorff et al. (2000) use a superposition of oriented Gaussians, 


is modeled by its center c,,- , unrotated widths er^ = (<r? , cA . ) , and orientation 0,; 7 . 

Given a representation for an environment matte, how can we go about estimating it for a 
particular object? The trick is to place the object in front of a monitor (or surrounded by a set 
of monitors), where we can change the illumination patterns B{x) and observe the value of 
each composite pixel C ). 7 

As with traditional two-screen matting (Section 10.4.1), we can use a variety of solid 
colored backgrounds to estimate each pixel’s foreground color ctj f 7 ) and partial coverage 
(opacity) oti. To estimate the area of support Aj in (13.2), Zongker, Werner, Curless et al. 
(1999) use a series of periodic horizontal and vertical solid stripes at different frequencies and 
phases, which is reminiscent of the structured light patterns used in active rangefinding (Sec- 
tion 12.2). For the more sophisticated mixture of Gaussian model (13.3), Chuang, Zongker, 
Hindorff et al. (2000) sweep a series of narrow Gaussian stripes at four different orientations 
(horizontal, vertical, and two diagonals), which enables them to estimate multiple oriented 
Gaussian responses at each pixel. 

Once an environment matte has been “pulled”, it is then a simple matter to replace the 
background with a new image B{x) to obtain a novel composite of the object placed in a 
different environment (Figure 13. lOa-c). The use of multiple backgrounds during the matting 
process, however, precludes the use of this technique with dynamic scenes, e.g., water pouring 
into a glass (Figure 13.1 Od) . In this case, a single graded color background can be used to 
estimate a single 2D monochromatic displacement for each pixel (Chuang, Zongker, Hindorff 
et al. 2000). 

13.4.1 Higher-dimensional light fields 

As you can tell from the preceding discussion, an environment matte in principle maps every 
pixel (x, y) into a 4D distribution over light rays and is, hence, a six-dimensional representa- 
tion. (In practice, each 2D pixel’s response is parameterized using a dozen or so parameters, 

7 If we relax the assumption that the environment is distant, the monitor can be placed at several depths to estimate 
a depth-dependent mapping function (Zongker, Werner, Curless et at. 1999). 
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Figure 13.11 The geometry-image continuum in image-based rendering (Kang, Szeliski, 
and Anandan 2000) © 2000 IEEE. Representations at the left of the spectrum use more 
detailed geometry and simpler image representations, while representations and algorithms 
on the right use more images and less geometry. 


e.g., { F, a, 73, R , A}, instead of a full mapping.) What if we want to model an object’s re- 
fractive properties from every potential point of view? In this case, we need a mapping from 
every incoming 4D light ray to every potential exiting 4D light ray, which is an 8D represen- 
tation. If we use the same trick as with surface light fields, we can parameterize each surface 
point by its 4D BRDF to reduce this mapping back down to 6D but this loses the ability to 
handle multiple refractive paths. 

If we want to handle dynamic light fields, we need to add another temporal dimension. 
(Wenger, Gardner, Tchou et al. (2005) gives a nice example of a dynamic appearance and 
illumination acquisition system.) Similarly, if we want a continuous distribution over wave- 
lengths, this becomes another dimension. 

These examples illustrate how modeling the full complexity of a visual scene through 
sampling can be extremely expensive. Fortunately, constructing specialized models, which 
exploit knowledge about the physics of light transport along with the natural coherence of 
real-world objects, can make these problems more tractable. 


13.4.2 The modeling to rendering continuum 

The image-based rendering representations and algorithms we have studied in this chapter 
span a continuum ranging from classic 3D texture-mapped models all the way to pure sampled 
ray-based representations such as light fields (Figure 13.11). Representations such as view- 
dependent texture maps and Lumigraphs still use a single global geometric model, but select 
the colors to map onto these surfaces from nearby images. View-dependent geometry, e.g., 
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multiple depth maps, sidestep the need for coherent 3D geometry, and can sometimes better 
model local non-rigid effects such as specular motion (Swaminathan, Kang, Szeliski et al. 
2002; Criminisi, Kang, Swaminathan et al. 2005). Sprites with depth and layered depth 
images use image-based representations of both color and geometry and can be efficiently 
rendered using warping operations rather than 3D geometric rasterization. 

The best choice of representation and rendering algorithm depends on both the quantity 
and quality of the input imagery as well as the intended application. When nearby views are 
being rendered, image-based representations capture more of the visual fidelity of the real 
world because they directly sample its appearance. On the other hand, if only a few input 
images are available or the image-based models need to be manipulated, e.g., to change their 
shape or appearance, more abstract 3D representations such as geometric and local reflection 
models are a better fit. As we continue to capture and manipulate increasingly larger quan- 
tities of visual data, research into these aspects of image-based modeling and rendering will 
continue to evolve. 


13.5 Video-based rendering 

Since multiple images can be used to render new images or interactive experiences, can some- 
thing similar be done with video? In fact, a fair amount of work has been done in the area 
of video-based rendering and video-based animation, two terms first introduced by Schodl, 
Szeliski, Salesin et al. (2000) to denote the process of generating new video sequences from 
captured video footage. An early example of such work is Video Rewrite (Bregler, Coveil, 
and Slaney 1997), in which archival video footage is “re-animated” by having actors say new 
utterances (Figure 13.12). More recently, the term video-based rendering has been used by 
some researchers to denote the creation of virtual camera moves from a set of synchronized 
video cameras placed in a studio (Magnor 2005). (The terms free-viewpoint video and 3D 
video are also sometimes used, see Section 13.5.4.) 

In this section, we present a number of video-based rendering systems and applications. 
We start with video-based animation (Section 13.5.1), in which video footage is re-arranged 
or modified, e.g., in the capture and re-rendering of facial expressions. A special case of this 
are video textures (Section 13.5.2), in which source video is automatically cut into segments 
and re-looped to create infinitely long video animations. It is also possible to create such 
animations from still pictures or paintings, by segmenting the image into separately moving 
regions and animating them using stochastic motion fields (Section 13.5.3). 

Next, we turn our attention to 3D video (Section 13.5.4), in which multiple synchronized 
video cameras are used to film a scene from different directions. The source video frames can 
then be re-combined using image-based rendering techniques, such as view interpolation, to 
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Figure 13.12 Video Rewrite (Bregler, Covell, and Slaney 1997) © 1997 ACM: the video 
frames are composed from bits and pieces of old video footage matched to a new audio track. 


create virtual camera paths between the source cameras as part of a real-time viewing expe- 
rience. Finally, we discuss capturing environments by driving or walking through them with 
panoramic video cameras in order to create interactive video-based walkthrough experiences 
(Section 13.5.5). 

13.5.1 Video-based animation 

As we mentioned above, an early example of video-based animation is Video Rewrite, in 
which frames from original video footage are rearranged in order to match them to novel 
spoken utterances, e.g., for movie dubbing (Figure 13.12). This is similar in spirit to the way 
that concatenative speech synthesis systems work (Taylor 2009). 

In their system, Bregler, Coveil, and Slaney (1997) first use speech recognition to extract 
phonemes from both the source video material and the novel audio stream. Phonemes are 
grouped into triphones (triplets of phonemes), since these better model the coarticulation 
effect present when people speak. Matching triphones are then found in the source footage 
and audio track. The mouth images corresponding to the selected video frames are then 
cut and pasted into the desired video footage being re-animated or dubbed, with appropriate 
geometric transformations to account for head motion. During the analysis phase, features 
corresponding to the lips, chin, and head are tracked using computer vision techniques. Dur- 
ing synthesis, image morphing techniques are used to blend and stitch adjacent mouth shapes 
into a more coherent whole. In more recent work, Ezzat, Geiger, and Poggio (2002) describe 
how to use a multidimensional morpliable model (Section 12.6.2) combined with regularized 
trajectory synthesis to improve these results. 

A more sophisticated version of this system, called face transfer, uses a novel source 
video, instead of just an audio track, to drive the animation of a previously captured video, i.e., 
to re-render a video of a talking head with the appropriate visual speech, expression, and head 
pose elements (Vlasic, Brand, Pfister et al. 2005). This work is one of many performance- 
driven animation systems (Section 4.1.5), which are often used to animate 3D facial models 
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(Figures 12.18-12.19). While traditional performance-driven animation systems use marker- 
based motion capture (Williams 1990; Litwinowicz and Williams 1994; Ma, Jones, Chiang 
et al. 2008), video footage can now often be used directly to control the animation (Buck, 
Finkelstein, Jacobs et al. 2000; Pighin, Szeliski, and Salesin 2002; Zhang, Snavely, Curless 
et al. 2004; Vlasic, Brand, Pfister et al. 2005; Roble and Zafar 2009). 

In addition to its most common application to facial animation, video-based animation 
can also be applied to whole body motion (Section 12.6.4), e.g., by matching the flow fields 
between two different source videos and using one to drive the other (Efros, Berg, Mori et al. 
2003). Another approach to video-based rendering is to use flow or 3D modeling to unwrap 
surface textures into stabilized images, which can then be manipulated and re-rendered onto 
the original video (Pighin, Szeliski, and Salesin 2002; Rav-Acha, Kohli, Fitzgibbon et al. 
2008). 

13.5.2 Video textures 

Video-based animation is a powerful means of creating photo-realistic videos by re-purposing 
existing video footage to match some other desired activity or script. What if instead of 
constructing a special animation or narrative, we simply want the video to continue playing 
in a plausible manner? For example, many Web sites use images or videos to highlight their 
destinations, e.g., to portray attractive beaches with surf and palm trees waving in the wind. 
Instead of using a static image or a video clip that has a discontinuity when it loops, can we 
transform the video clip into an infinite-length animation that plays forever? 

This idea is the basis of video textures , in which a short video clip can be arbitrarily 
extended by re-arranging video frames while preserving visual continuity (Schodl, Szeliski, 
Salesin et al. 2000). The basic problem in creating video textures is how to perform this 
re-arrangement without introducing visual artifacts. Can you think of how you might do this? 

The simplest approach is to match frames by visual similarity (e.g., L 2 distance) and to 
jump between frames that appear similar. Unfortunately, if the motions in the two frames 
are different, a dramatic visual artifact will occur (the video will appear to “stutter”). For 
example, if we fail to match the motions of the clock pendulum in Figure 13.13a, it can 
suddenly change direction in mid-swing. 

How can we extend our basic frame matching to also match motion? In principle, we 
could compute optic flow at each frame and match this. However, flow estimates are often 
unreliable (especially in textureless regions) and it is not clear how to weight the visual and 
motion similarities relative to each other. As an alternative, Schodl, Szeliski, Salesin et al. 
(2000) suggest matching triplets or larger neighborhoods of adjacent video frames, much 
in the same way as Video Rewrite matches triphones. Once we have constructed an n x 
n similarity matrix between all video frames (where n is the number of frames), a simple 
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Figure 13.13 Video textures (Schodl, Szeliski, Salesin el al. 2000) © 2000 ACM: (a) a 
clock pendulum, with correctly matched direction of motion; (b) a candle flame, showing 
temporal transition arcs; (c) the flag is generated using morphing at jumps; (d) a bonfire 
uses longer cross-dissolves; (e) a waterfall cross-dissolves several sequences at once; (f) a 
smiling animated face; (g) two swinging children are animated separately; (h) the balloons 
are automatically segmented into separate moving regions; (i) a synthetic fish tank consisting 
of bubbles, plants, and fish. Videos corresponding to these images can be found at http: 
//www.cc. gatech.edu/gvu/perception/projects/videotexture/. 
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finite impulse response (FIR) filtering of each match sequence can be used to emphasize 
subsequences that match well. 

The results of this match computation gives us a jump table or, equivalently, a transition 
probability between any two frames in the original video. This is shown schematically as 
red arcs in Figure 13.13b, where the red bar indicates which video frame is currently be- 
ing displayed, and arcs light up as a forward or backward transition is taken. We can view 
these transition probabilities as encoding the hidden Markov model (HMM) that underlies a 
stochastic video generation process. 

Sometimes, it is not possible to find exactly matching subsequences in the original video. 
In this case, morphing, i.e., warping and blending frames during transitions (Section 3.6.3) 
can be used to hide the visual differences (Figure 13.13c). If the motion is chaotic enough, 
as in a bonfire or a waterfall (Figures 13.13d-e), simple blending (extended cross-dissolves) 
may be sufficient. Improved transitions can also be obtained by performing 3D graph cuts on 
the spatio-temporal volume around a transition (Kwatra, Schodl, Essa et al. 2003). 

Video textures need not be restricted to chaotic random phenomena such as fire, wind, 
and water. Pleasing video textures can be created of people, e.g., a smiling face (as in Fig- 
ure 13. 13f) or someone running on a treadmill (Schodl, Szeliski, Salesin et al. 2000). When 
multiple people or objects are moving independently, as in Figures 13.13 g— h, we must first 
segment the video into independently moving regions and animate each region separately. 
It is also possible to create large panoramic video textures from a slowly panning camera 
(Agarwala, Zheng, Pal et al. 2005). 

Instead of just playing back the original frames in a stochastic (random) manner, video 
textures can also be used to create scripted or interactive animations. If we extract individual 
elements, such as fish in a fishtank (Figure 13. 13i) into separate video sprites, we can animate 
them along pre-specified paths (by matching the path direction with the original sprite motion) 
to make our video elements move in a desired fashion (Schodl and Essa 2002). In fact, work 
on video textures inspired research on systems that re-synthesize new motion sequences from 
motion capture data, which some people refer to as “mocap soup” (Arikan and Forsyth 2002; 
Kovar, Gleicher, and Pighin 2002; Lee, Chai, Reitsma et al. 2002; Li, Wang, and Shum 2002; 
Pullen and Bregler 2002). 

While video textures primarily analyze the video as a sequence of frames (or regions) that 
can be re-arranged in time, temporal textures (Szummer and Picard 1996; Bar-Joseph, El- 
Yaniv, Lischinski et al. 2001) and dynamic textures (Doretto, Chiuso, Wu et al. 2003; Yuan, 
Wen, Liu et al. 2004; Doretto and Soatto 2006) treat the video as a 3D spatio-temporal volume 
with textural properties, which can be described using auto-regressive temporal models. 
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Figure 13.14 Animating still pictures (Chuang, Goldman, Zheng el al. 2005) © 2005 ACM. 
(a) The input still image is manually segmented into (b) several layers, (c) Each layer is 
then animated with a different stochastic motion texture (d) The animated layers are then 
composited to produce (e) the final animation 

13.5.3 Application : Animating pictures 

While video textures can turn a short video clip into an infinitely long video, can the same 
thing be done with a single still image? The answer is yes, if you are willing to first segment 
the image into different layers and then animate each layer separately. 

Chuang, Goldman, Zheng el al. (2005) describe how an image can be decomposed into 
separate layers using interactive matting techniques. Each layer is then animated using a 
class-specific synthetic motion. As shown in Figure 13.14, boats rock back and forth, trees 
sway in the wind, clouds move horizontally, and water ripples, using a shaped noise displace- 
ment map. All of these effects can be tied to some global control parameters, such as the 
velocity and direction of a virtual wind. After being individually animated, the layers can be 
composited to create a final dynamic rendering. 


13.5.4 3D Video 

In recent years, the popularity of 3D movies has grown dramatically, with recent releases 
ranging from Hannah Montana, through U2’s 3D concert movie, to James Cameron’s Avatar. 
Currently, such releases are filmed using stereoscopic camera rigs and displayed in theaters 
(or at home) to viewers wearing polarized glasses. 8 In the future, however, home audiences 
may wish to view such movies with multi-zone auto-stereoscopic displays, where each person 
gets his or her own customized stereo stream and can move around a scene to see it from 


http://www.3d-summit.com/. 
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Figure 13.15 Video view interpolation (Zitnick, Kang, Uyttendaele et al. 2004) © 2004 
ACM: (a) the capture hardware consists of eight synchronized cameras; (b) the background 
and foreground images from each camera are rendered and composited before blending; (c) 
the two-layer representation, before and after boundary matting; (d) background color esti- 
mates; (e) background depth estimates; (f) foreground color estimates. 


different perspectives. 9 

The stereo matching techniques developed in the computer vision community along with 
image-based rendering (view interpolation) techniques from graphics are both essential com- 
ponents in such scenarios, which are sometimes called free-viewpoint video (Carranza, Theobalt, 
Magnor el al. 2003) or virtual viewpoint video (Zitnick, Kang, Uyttendaele et al. 2004). In 
addition to solving a series of per-frame reconstruction and view interpolation problems, the 
depth maps or proxies produced by the analysis phase must be temporally consistent in order 
to avoid flickering artifacts. 

Shum, Chan, and Kang (2007) and Magnor (2005) present nice overviews of various 
video view interpolation techniques and systems. These include the Virtualized Reality sys- 
tem of Kanade, Rander, and Narayanan (1997) and Vedula, Baker, and Kanade (2005), Im- 
mersive Video (Moezzi, Katkere, Kuramura et al. 1996), Image-Based Visual Hulls (Matusik, 
Buehler, Raskar et al. 2000; Matusik, Buehler, and McMillan 2001), and Free-Viewpoint 
Video (Carranza, Theobalt, Magnor et al. 2003), which all use global 3D geometric models 
(surface-based (Section 12.3) or volumetric (Section 12.5)) as their proxies for rendering. 
The work of Vedula, Baker, and Kanade (2005) also computes scene flow, i.e., the 3D motion 
between corresponding surface elements, which can then be used to perform spatio-temporal 
interpolation of the multi-view video stream. 

The Virtual Viewpoint Video system of Zitnick, Kang, Uyttendaele et al. (2004), on the 

9 http://www.siggraph.org/s2008/attendees/caf/3d/. 
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other hand, associates a two-layer depth map with each input image, which allows them to 
accurately model occlusion effects such as the mixed pixels that occur at object boundaries. 
Their system, which consists of eight synchronized video cameras connected to a disk array 
(Figure 13.15a), first uses segmentation-based stereo to extract a depth map for each input 
image (Figure 13.15e). Near object boundaries (depth discontinuities), the background layer 
is extended along a strip behind the foreground object (Figure 13.15c) and its color is es- 
timated from the neighboring images where it is not occluded (Figure 13. 15d). Automated 
matting techniques (Section 10.4) are then used to estimate the fractional opacity and color 
of boundary pixels in the foreground layer (Figure 13. 15f). 

At render time, given a new virtual camera that lies between two of the original cameras, 
the layers in the neighboring cameras are rendered as texture-mapped triangles and the fore- 
ground layer (which may have fractional opacities) is then composited over the background 
layer (Figure 13.15b). The resulting two images are merged and blended by comparing their 
respective z-buffer values. (Whenever the two z-values are sufficiently close, a linear blend of 
the two colors is computed.) The interactive rendering system runs in real time using regular 
graphics hardware. It can therefore be used to change the observer’s viewpoint while playing 
the video or to freeze the scene and explore it in 3D. More recently, Rogmans, Lu, Bekaert 
et al. (2009) have developed GPU implementations of both real-time stereo matching and 
real-time rendering algorithms, which enable them to explore algorithmic alternatives in a 
real-time setting. 

At present, the depth maps computed from the eight stereo cameras using off-line stereo 
matching have produced the highest quality depth maps associated with live video. 10 They 
are therefore often used in studies of 3D video compression, which is an active area of re- 
search (Smolic and Kauff 2005; Gotchev and Rosenhahn 2009). Active video-rate depth 
sensing cameras, such as the 3DV Zcam (Iddan and Yahav 2001), which we discussed in 
Section 12.2.1, are another potential source of such data. 

When large numbers of closely spaced cameras are available, as in the Stanford Light 
Field Camera (Wilburn, Joshi, Vaish et al. 2005), it may not always be necessary to compute 
explicit depth maps to create video-based rendering effects, although the results are usually 
of higher quality if you do (Vaish, Szeliski, Zitnick et al. 2006). 

13.5.5 Application : Video-based walkthroughs 

Video camera arrays enable the simultaneous capture of 3D dynamic scenes from multiple 
viewpoints, which can then enable the viewer to explore the scene from viewpoints near the 
original capture locations. What if instead we wish to capture an extended area, such as a 
home, a movie set, or even an entire city? 

10 http://research.microsoft.com/en-us/um/redmond/groups/ivm/vvv/. 
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In this case, it makes more sense to move the camera through the environment and play 
back the video as an interactive video-based walkthrough. In order to allow the viewer to 
look around in all directions, it is preferable to use a panoramic video camera (Uyttendaele, 
Criminisi, Kang et al. 2004). 11 

One way to structure the acquisition process is to capture these images in a 2D horizontal 
plane, e.g., over a grid superimposed inside a room. The resulting sea of images (Aliaga, 
Funkhouser, Yanovsky et al. 2003) can be used to enable continuous motion between the 
captured locations. 12 However, extending this idea to larger settings, e.g., beyond a single 
room, can become tedious and data-intensive. 

Instead, a natural way to explore a space is often to just walk through it along some pre- 
specified paths, just as museums or home tours guide users along a particular path, say down 
the middle of each room. 13 Similarly, city-level exploration can be achieved by driving down 
the middle of each street and allowing the user to branch at each intersection. This idea dates 
back to the Aspen MovieMap project (Lippman 1980), which recorded analog video taken 
from moving cars onto videodiscs for later interactive playback. 

Recent improvements in video technology now enable the capture of panoramic (spheri- 
cal) video using a small co-located array of cameras, such as the Point Grey Ladybug cam- 
era 14 (Figure 13.16b) developed by Uyttendaele, Criminisi, Kang et al. (2004) for their inter- 
active video-based walkthrough project. In their system, the synchronized video streams from 
the six cameras (Figure 13.16a) are stitched together into 360° panoramas using a variety of 
techniques developed specifically for this project. 

Because the cameras do not share the same center of projection, parallax between the 
cameras can lead to ghosting in the overlapping fields of view (Figure 13.16c). To remove 
this, a multi-perspective plane sweep stereo algorithm is used to estimate per-pixel depths at 
each column in the overlap area. To calibrate the cameras relative to each other, the camera 
is spun in place and a constrained structure from motion algorithm (Figure 7.8) is used to 
estimate the relative camera poses and intrinsics. Feature tracking is then run on the walk- 
through video in order to stabilize the video sequence — Liu, Gleicher, lin et al. (2009) have 
carried out more recent work along these lines. 

Indoor environments with windows, as well as sunny outdoor environments with strong 
shadows, often have a dynamic range that exceeds the capabilities of video sensors. For 
this reason, the Ladybug camera has a programmable exposure capability that enables the 
bracketing of exposures at subsequent video frames. In order to merge the resulting video 

11 See http://www.cis.upenn.edu/~kostas/omni.html for descriptions of panoramic (omnidirectional) vision sys- 
tems and associated workshops. 

12 (The Photo Tourism system of Snavely, Seitz, and Szeliski (2006) applies this idea to less structured collections. 

1 3 In computer games, restricting a player to forward and backward motion along predetermined paths is called 
rail-based gaming. 

14 http://www.ptgrey.com/. 
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(e) (f) (g) 

Figure 13.16 Video-based walkthroughs (Uyttendaele, Criminisi, Kang et al. 2004) © 2004 
IEEE: (a) system diagram of video pre-processing; (b) the Point Grey Ladybug camera; (c) 
ghost removal using multi-perspective plane sweep; (d) point tracking, used both for calibra- 
tion and stabilization; (e) interactive garden walkthrough with map below; (f) overhead map 
authoring and sound placement; (g) interactive home walkthrough with navigation bar (top) 
and icons of interest (bottom). 
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frames into high dynamic range (HDR) video, pixels from adjacent frames need to be motion- 
compensated before being merged (Kang, Uyttendaele, Winder et al. 2003). 

The interactive walk-through experience becomes much richer and more navigable if an 
overview map is available as part of the experience. In Figure 13. 16f, the map has annotations, 
which can show up during the tour, and localized sound sources, which play (with different 
volumes) when the viewer is nearby. The process of aligning the video sequence with the 
map can be automated using a process called map correlation (Levin and Szeliski 2004). 

All of these elements combine to provide the user with a rich, interactive, and immersive 
experience. Figure 13.16e shows a walk through the Bellevue Botanical Gardens, with an 
overview map in perspective below the live video window. Arrows on the ground are used to 
indicate potential directions of travel. The viewer simply orients his view towards one of the 
arrows (the experience can be driven using a game controller) and “walks” forward along the 
desired path. 

Figure 13. 16g shows an indoor home tour experience. In addition to a schematic map 
in the lower left corner and adjacent room names along the top navigation bar, icons appear 
along the bottom whenever items of interest, such as a homeowner’s art pieces, are visible 
in the main window. These icons can then be clicked to provide more information and 3D 
views. 

The development of interactive video tours spurred a renewed interest in 360° video-based 
virtual travel and mapping experiences, as evidenced by commercial sites such as Google’s 
Street View and Bing Maps. The same videos can also be used to generate turn-by-tum driv- 
ing directions, taking advantage of both expanded fields of view and image-based rendering 
to enhance the experience (Chen, Neubert, Ofek et al. 2009). 

As we continue to capture more and more of our real world with large amounts of high- 
quality imagery and video, the interactive modeling, exploration, and rendering techniques 
described in this chapter will play an even bigger role in bringing virtual experiences based 
on remote areas of the world closer to everyone. 


13.6 Additional reading 

Two good recent surveys of image-based rendering are by Kang, Li, Tong et al. (2006) and 
Shum, Chan, and Kang (2007), with earlier surveys available from Kang (1999), McMillan 
and Gortler (1999), and Debevec (1999). The term image-based rendering was introduced by 
McMillan and Bishop (1995), although the seminal paper in the field is the view interpolation 
paper by Chen and Williams (1993). Debevec, Taylor, and Malik (1996) describe their Fagade 
system, which not only created a variety of image-based modeling tools but also introduced 
the widely used technique of view-dependent texture mapping. 
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Early work on planar impostors and layers was carried out by Shade, Lischinski, Salesin 
et al. (1996), Lengyel and Snyder (1997), and Torborg and Kajiya (1996), while newer work 
based on sprites with depth is described by Shade, Gortler, He et al. (1998). 

The two foundational papers in image-based rendering are Light field rendering by Levoy 
and Hanrahan (1996) and The Lumigraph by Gortler, Grzeszczuk, Szeliski et al. (1996). 
Buehler, Bosse, McMillan et al. (2001) generalize the Lumigraph approach to irregularly 
spaced collections of images, while Levoy (2006) provides a survey and more gentle intro- 
duction to the topic of light field and image-based rendering. 

Surface light fields (Wood, Azuma, Aldinger et al. 2000) provide an alternative param- 
eterization for light fields with accurately known surface geometry and support both better 
compression and the possibility of editing surface properties. Concentric mosaics (Shum and 
He 1999; Shum, Wang, Chai et al. 2002) and panoramas with depth (Peleg, Ben-Ezra, and 
Pritch 2001; Li, Shum, Tang et al. 2004; Zheng, Kang, Cohen et al. 2007), provide useful 
parameterizations for light fields captured with panning cameras. Multi-perspective images 
(Rademacher and Bishop 1998) and manifold projections (Peleg and Herman 1997), although 
not true light fields, are also closely related to these ideas. 

Among the possible extensions of light fields to higher-dimensional structures, environ- 
ment mattes (Zongker, Werner, Curless et al. 1999; Chuang, Zongker, Hindorff et al. 2000) 
are the most useful, especially for placing captured objects into new scenes. 

Video-based rendering, i.e., the re-use of video to create new animations or virtual ex- 
periences, started with the seminal work of Szummer and Picard (1996), Bregler, Covell, 
and Slaney (1997), and Schodl, Szeliski, Salesin et al. (2000). Important follow-on work 
to these basic re-targeting approaches was carried out by Schodl and Essa (2002), Kwatra, 
Schodl, Essa et al. (2003), Doretto, Chiuso, Wu et al. (2003), Wang and Zhu (2003), Zhong 
and Sclaroff (2003), Yuan, Wen, Liu et al. (2004), Doretto and Soatto (2006), Zhao and 
Pietikainen (2007), and Chan and Vasconcelos (2009). 

Systems that allow users to change their 3D viewpoint based on multiple synchronized 
video streams include those by Moezzi, Katkere, Kuramura et al. (1996), Kanade, Ran- 
der, and Narayanan (1997), Matusik, Buehler, Raskar et al. (2000), Matusik, Buehler, and 
McMillan (2001), Carranza, Theobalt, Magnor et al. (2003), Zitnick, Kang, Uyttendaele et 
al. (2004), Magnor (2005), and Vedula, Baker, and Kanade (2005). 3D (multiview) video 
coding and compression is also an active area of research (Smolic and Kauff 2005; Gotchev 
and Rosenhahn 2009), with 3D Blu-Ray discs, encoded using the multiview video coding 
(MVC) extension to H.264/MPEG-4 AVC, expected by the end of 2010. 
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13.7 Exercises 

Ex 13.1: Depth image rendering Develop a “view extrapolation” algorithm to re-render a 
previously computed stereo depth map coupled with its corresponding reference color image. 

1. Use a 3D graphics mesh rendering system such as OpenGL or Direct3D, with two 
triangles per pixel quad and perspective (projective) texture mapping (Debevec, Yu, 
and Borshukov 1998). 

2. Alternatively, use the one- or two-pass forward warper you constructed in Exercise 3.24, 
extended using (2.68-2.70) to convert from disparities or depths into displacements. 

3. (Optional) Kinks in straight lines introduced during view interpolation or extrapola- 
tion are visually noticeable, which is one reason why image morphing systems let you 
specify line correspondences (Beier and Neely 1992). Modify your depth estimation 
algorithm to match and estimate the geometry of straight lines and incorporate it into 
your image-based rendering algorithm. 

Ex 13.2: View interpolation Extend the system you created in the previous exercise to ren- 
der two reference views and then blend the images using a combination of z-buffering, hole 
filing, and blending (morphing) to create the final image (Section 13.1). 

1 . (Optional) If the two source images have very different exposures, the hole-filled re- 
gions and the blended regions will have different exposures. Can you extend your 
algorithm to mitigate this? 

2. (Optional) Extend your algorithm to perform three-way (trilinear) interpolation be- 
tween neighboring views. You can triangulate the reference camera poses and use 
barycentric coordinates for the virtual camera in order to determine the blending weights 

Ex 13.3: View morphing Modify your view interpolation algorithm to perform morphs be- 
tween views of a non-rigid object, such as a person changing expressions. 

1 . Instead of using a pure stereo algorithm, use a general flow algorithm to compute dis- 
placements, but separate them into a rigid displacement due to camera motion and a 
non-rigid deformation. 

2. At render time, use the rigid geometry to determine the new pixel location but then add 
a fraction of the non-rigid displacement as well. 

3. Alternatively, compute a stereo depth map but let the user specify additional correspon- 
dences or use a feature-based matching algorithm to provide them automatically. 
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4. (Optional) Take a single image, such as the Mona Lisa or a friend’s picture, and create 
an animated 3D view morph (Seitz and Dyer 1996). 

(a) Find the vertical axis of symmetry in the image and reflect your reference image 
to provide a virtual pair (assuming the person’s hairstyle is somewhat symmetric). 

(b) Use structure from motion to determine the relative camera pose of the pair. 

(c) Use dense stereo matching to estimate the 3D shape. 

(d) Use view morphing to create a 3D animation. 

Ex 13.4: View dependent texture mapping Use a 3D model you created along with the 
original images to implement a view-dependent texture mapping system. 

1. Use one of the 3D reconstruction techniques you developed in Exercises 7.3, 11.9, 
1 1 . 1 0, or 1 2. 8 to build a triangulated 3D image-based model from multiple photographs . 

2. Extract textures for each model face from your photographs, either by performing the 
appropriate resampling or by figuring out how to use the texture mapping software to 
directly access the source images. 

3. At run time, for each new camera view, select the best source image for each visible 
model face. 

4. Extend this to blend between the top two or three textures. This is trickier, since it 
involves the use of texture blending or pixel shading (Debevec, Taylor, and Malik 1996; 
Debevec, Yu, and Borshukov 1998; Pighin, Hecker, Lischinski et al. 1998). 

Ex 13.5: Layered depth images Extend your view interpolation algorithm (Exercise 13.2) 
to store more than one depth or color value per pixel (Shade, Gortler, He et al. 1998), i.e., a 
layered depth image (LDI). Modify your rendering algorithm accordingly. For your data, you 
can use synthetic ray tracing, a layered reconstructed model, or a volumetric reconstruction. 

Ex 13.6: Rendering from sprites or layers Extend your view interpolation algorithm to 
handle multiple planes or sprites (Section 13.2.1) (Shade, Gortler, He et al. 1998). 

1. Extract your layers using the technique you developed in Exercise 8.9. 

2. Alternatively, use an interactive painting and 3D placement system to extract your lay- 
ers (Kang 1998; Oh, Chen, Dorsey et al. 2001; Shum, Sun, Yamazaki et al. 2004). 

3. Determine a back-to-front order based on expected visibility or add a z-buffer to your 
rendering algorithm to handle occlusions. 
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4. Render and composite all of the resulting layers, with optional alpha matting to handle 
the edges of layers and sprites. 

Ex 13.7: Light field transformations Derive the equations relating regular images to 4D 
light field coordinates. 

1. Determine the mapping between the far plane (u, v ) coordinates and a virtual camera’s 
(x, y) coordinates. 

(a) Start by parameterizing a 3D point on the uv plane in terms of its (u. v ) coordi- 
nates. 

(b) Project the resulting 3D point to the camera pixels (x, y, 1) using the usual 3x4 
camera matrix P (2.63). 

(c) Derive the 2D homography relating ( u , v ) and ( x , y) coordinates. 

2. Write down a similar transformation for (s, t) to (x, y) coordinates. 

3. Prove that if the virtual camera is actually on the (s, t) plane, the (s, t) value depends 
only on the camera’s optical center and is independent of (x, y). 

4. Prove that an image taken by a regular orthographic or perspective camera, i.e., one that 
has a linear projective relationship between 3D points and (x, y) pixels (2.63), samples 
the ( s , t, u, v) light field along a two-dimensional hyperplane. 

Ex 13.8: Light field and Lumigraph rendering Implement a light field or Lumigraph ren- 
dering system: 

1. Download one of the light field data sets from http://lightfield.stanford.edu/. 

2. Write an algorithm to synthesize a new view from this light field, using quadri-linear 
interpolation of (s, t, u, v ) ray samples. 

3. Try varying the focal plane corresponding to your desired view (Isaksen, McMillan, 
and Gortler 2000) and see if the resulting image looks sharper. 

4. Determine a 3D proxy for the objects in your scene. You can do this by running multi- 
view stereo over one of your light fields to obtain a depth map per image. 

5. Implement the Lumigraph rendering algorithm, which modifies the sampling of rays 
according to the 3D location of each surface element. 

6. Collect a set of images yourself and determine their pose using structure from motion. 
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7. Implement the unstructured Lumigraph rendering algorithm from Buehler, Bosse, McMil- 
lan et al. (2001). 

Ex 13.9: Surface light fields Construct a surface light field (Wood, Azuma, Aldinger et al. 

2000) and see how well you can compress it. 

1 . Acquire an interesting light field of a specular scene or object, or download one from 
http ://lightfield. Stanford .edu/. 

2. Build a 3D model of the object using a multi-view stereo algorithm that is robust to 
outliers due to specularities. 

3. Estimate the Lumisphere for each surface point on the object. 

4. Estimate its diffuse components. Is the median the best way to do this? Why not use 
the minimum color value? What happens if there is Lambertian shading on the diffuse 
component? 

5. Model and compress the remaining portion of the Lumisphere using one of the tech- 
niques suggested by Wood, Azuma, Aldinger et al. (2000) or invent one of your own. 

6. Study how well your compression algorithm works and what artifacts it produces. 

7. (Optional) Develop a system to edit and manipulate your surface light field. 

Ex 13.10: Handheld concentric mosaics Develop a system to navigate a handheld con- 
centric mosaic. 

1 . Stand in the middle of a room with a camcorder held at arm’s length in front of you and 
spin in a circle. 

2. Use a structure from motion system to determine the camera pose and sparse 3D struc- 
ture for each input frame. 

3. (Optional) Re -bin your image pixels into a more regular concentric mosaic structure. 

4. At view time, determine from the new camera’s view (which should be near the plane 
of your original capture) which source pixels to display. You can simplify your com- 
putations to determine a source column (and scaling) for each output column. 

5. (Optional) Use your sparse 3D structure, interpolated to a dense depth map, to improve 
your rendering (Zheng, Kang, Cohen et al. 2007). 
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Ex 13.11: Video textures Capture some videos of natural phenomena, such as a water 
fountain, fire, or smiling face, and loop the video seamlessly into an infinite length video 
(Schodl, Szeliski, Salesin et al. 2000). 

1. Compare all the frames in the original clip using an L 2 (sum of square difference) 
metric. (This assumes the videos were shot on a tripod or have already been stabilized.) 

2. Filter the comparison table temporally to accentuate temporal sub-sequences that match 
well together. 

3. Convert your similarity table into a jump probability table through some exponential 
distribution. Be sure to modify transitions near the end so you do not get “stuck” in the 
last frame. 

4. Starting with the first frame, use your transition table to decide whether to jump for- 
ward, backward, or continue to the next frame. 

5. (Optional) Add any of the other extensions to the original video textures idea, such 
as multiple moving regions, interactive control, or graph cut spatio-temporal texture 
seaming. 
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(g) 


(h) 


(i) 


Figure 14.1 Recognition: face recognition with (a) pictorial structures (Fischler and 
Elschlager 1973) © 1973 IEEE and (b) eigenfaces (Turk and Pentland 1991b); (c) real- 
time face detection (Viola and Jones 2004) © 2004 Springer; (d) instance (known object) 
recognition (Lowe 1999) © 1999 IEEE; (e) feature-based recognition (Fergus, Perona, and 
Zisserman 2007); (f) region-based recognition (Mori, Ren, Efros et al. 2004) © 2004 IEEE; 
(g) simultaneous recognition and segmentation (Shotton, Winn, Rother et al. 2009) © 2009 
Springer; (h) location recognition (Philbin, Chum, Isard et al. 2007) © 2007 IEEE; (i) using 
context (Russell, Torralba, Liu et al. 2007). 
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Of all the visual tasks we might ask a computer to perform, analyzing a scene and recog- 
nizing all of the constituent objects remains the most challenging. While computers excel at 
accurately reconstructing the 3D shape of a scene from images taken from different views, 
they cannot name all the objects and animals present in a picture, even at the level of a two- 
year-old child. There is not even any consensus among researchers on when this level of 
performance might be achieved. 

Why is recognition so hard? The real world is made of a jumble of objects, which all oc- 
clude one another and appear in different poses. Furthermore, the variability intrinsic within 
a class (e.g., dogs), due to complex non-rigid articulation and extreme variations in shape and 
appearance (e.g., between different breeds), makes it unlikely that we can simply perform 
exhaustive matching against a database of exemplars. 1 

The recognition problem can be broken down along several axes. For example, if we 
know what we are looking for, the problem is one of object detection (Section 14.1), which 
involves quickly scanning an image to determine where a match may occur (Figure 14.1c). If 
we have a specific rigid object we are trying to recognize ( instance recognition. Section 14.3), 
we can search for characteristic feature points (Section 4.1) and verify that they align in a 
geometrically plausible way (Section 14.3.1) (Figure 14. Id). 

The most challenging version of recognition is general category (or class) recognition 
(Section 14.4), which may involve recognizing instances of extremely varied classes such 
as animals or furniture. Some techniques rely purely on the presence of features (known 
as a “bag of words” model — see Section 14.4.1), their relative positions (part-based models 
(Section 14.4.2)), Figure 14. le, while others involve segmenting the image into semantically 
meaningful regions (Section 14.4.3) (Figure 14. If). In many instances, recognition depends 
heavily on the context of surrounding objects and scene elements (Section 14.5). Woven into 
all of these techniques is the topic of learning (Section 14.5.1), since hand-crafting specific 
object recognizers seems like a futile approach given the complexity of the problem. 

Given the extremely rich and complex nature of this topic, this chapter is structured to 
build from simpler concepts to more complex ones. We begin with a discussion of face and 
object detection (Section 14.1), where we introduce a number of machine-learning techniques 
such as boosting, neural networks, and support vector machines. Next, we study face recogni- 
tion (Section 14.2), which is one of the more widely known applications of recognition. This 
topic serves as an introduction to subspace (PCA) models and Bayesian approaches to recog- 
nition and classification. We then present techniques for instance recognition (Section 14.3), 
building upon earlier topics in this book, such as feature detection, matching, and geomet- 
ric alignment (Section 14.3.1). We introduce topics from the information and document re- 
trieval communities, such as frequency vectors, feature quantization, and inverted indices 

1 However, some recent research suggests that direct image matching may be feasible for large enough databases 
(Russell. Ton-alba, Liu et at. 2007; Malisiewicz and Efros 2008; Torralba, Freeman, and Fergus 2008). 
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(Section 14.3.2). We also present applications of location recognition (Section 14.3.3). 

In the second half of the chapter, we address the most challenging variant of recognition, 
namely the problem of category recognition (Section 14.4). This includes approaches that use 
bags of features (Section 14.4.1), parts (Section 14.4.2), and segmentation (Section 14.4.3). 
We show how such techniques can be used to automate photo editing tasks, such as 3D mod- 
eling, scene completion, and creating collages (Section 14.4.4). Next, we discuss the role 
that context can play in both individual object recognition and more holistic scene under- 
standing (Section 14.5). We close this chapter with a discussion of databases and test sets for 
constructing and evaluating recognition systems (Section 14.6). 

While there is no comprehensive reference on object recognition, an excellent set of notes 
can be found in the ICCV 2009 short course (Fei-Fei, Fergus, and Torralba 2009), Antonio 
Torralba’s more comprehensive MIT course (Torralba 2008), and two recent collections of 
papers (Ponce, Hebert, Schmid et al. 2006; Dickinson, Leonardis, Schiele et al. 2007) and a 
survey on object categorization (Pinz 2005). An evaluation of some of the best performing 
recognition algorithms can be found on the PASCAL Visual Object Classes (VOC) Challenge 
Web site at http://pascallin.ecs. soton.ac.uk/challenges/VOC/. 


14.1 Object detection 

If we are given an image to analyze, such as the group portrait in Figure 14.2, we could try to 
apply a recognition algorithm to every possible sub-window in this image. Such algorithms 
are likely to be both slow and error-prone. Instead, it is more effective to construct special- 
purpose detectors , whose job it is to rapidly find likely regions where particular objects might 
occur. 

We begin this section with face detectors, which are some of the more successful examples 
of recognition. For example, such algorithms are built into most of today’s digital cameras to 
enhance auto-focus and into video conferencing systems to control pan-tilt heads. We then 
look at pedestrian detectors, as an example of more general methods for object detection. 
Such detectors can be used in automotive safety applications, e.g., detecting pedestrians and 
other cars from moving vehicles (Leibe, Cornelis, Cornells el al. 2007). 


14.1.1 Face detection 

Before face recognition can be applied to a general image, the locations and sizes of any faces 
must first be found (Figures 14.1c and 14.2). In principle, we could apply a face recognition 
algorithm at every pixel and scale (Moghaddam and Pentland 1997) but such a process would 
be too slow in practice. 
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Figure 14.2 Face detection results produced by Rowley, Baluja, and Kanade (1998a) © 
1998 IEEE. Can you find the one false positive (a box around a non-face) among the 57 true 
positive results? 


Over the years, a wide variety of fast face detection algorithms have been developed. 
Yang, Kriegman, and Ahuja (2002) provide a comprehensive survey of earlier work in this 
field; Yang’s ICPR 2004 tutorial 2 and the Torralba (2007) short course provide more recent 
reviews. 3 

According to the taxonomy of Yang, Kriegman, and Ahuja (2002), face detection tech- 
niques can be classified as feature-based, template-based, or appearance-based. Feature- 
based techniques attempt to find the locations of distinctive image features such as the eyes, 
nose, and mouth, and then verify whether these features are in a plausible geometrical ar- 
rangement. These techniques include some of the early approaches to face recognition (Fis- 
chler and Elschlager 1973; Kanade 1977; Yuille 1991), as well as more recent approaches 
based on modular eigenspaces (Moghaddam and Pentland 1997), local filter jets (Leung, 
Burl, and Perona 1995; Penev and Atick 1996; Wiskott, Fellous, Kruger et al. 1997), support 
vector machines (Heisele, Ho, Wu et al. 2003; Heisele, Serre, and Poggio 2007), and boosting 
(Schneiderman and Kanade 2004). 

Template-based approaches, such as active appearance models (AAMs) (Section 14.2.2), 
can deal with a wide range of pose and expression variability. Typically, they require good 
initialization near a real face and are therefore not suitable as fast face detectors. 

2 http://vision.ai.uiuc.edu/mhyang/face-detection-survey.html. 

3 An alternative approach to detecting faces is to look for regions of skin color in the image (Forsyth and Fleck 
1999; Jones and Rehg 2001). See Exercise 2.8 for some additional discussion and references. 
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(a) 




(c) 


Figure 14.3 Pre-processing stages for face detector training (Rowley, Baluja, and Kanade 
1998a) © 1998 IEEE: (a) artificially mirroring, rotating, scaling, and translating training 
images for greater variability; (b) using images without faces (looking up at a tree) to generate 
non-face examples; (c) pre-processing the patches by subtracting a best fit linear function 
(constant gradient) and histogram equalizing. 


Appearance-based approaches scan over small overlapping rectangular patches of the im- 
age searching for likely face candidates, which can then be refined using a cascade of more 
expensive but selective detection algorithms (Sung and Poggio 1998; Rowley, Baluja, and 
Kanade 1998a; Romdhani, Torr, Scholkopf et al. 2001; Fleuret and Geman 2001; Viola and 
Jones 2004). In order to deal with scale variation, the image is usually converted into a 
sub-octave pyramid and a separate scan is performed on each level. Most appearance-based 
approaches today rely heavily on training classifiers using sets of labeled face and non-face 
patches. 

Sung and Poggio (1998) and Rowley, Baluja, and Kanade (1998a) present two of the ear- 
liest appearance-based face detectors and introduce a number of innovations that are widely 
used in later work by others. 

To start with, both systems collect a set of labeled face patches (Figure 14.2) as well as a 
set of patches taken from images that are known not to contain faces, such as aerial images or 
vegetation (Figure 14.3b). The collected face images are augmented by artificially mirroring, 
rotating, scaling, and translating the images by small amounts to make the face detectors less 
sensitive to such effects (Figure 14.3a). 

After an initial set of training images has been collected, some optional pre-processing 
can be performed, such as subtracting an average gradient (linear function) from the image 
to compensate for global shading effects and using histogram equalization to compensate for 
varying camera contrast (Figure 14.3c). 
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Figure 14.4 Learning a mixture of Gaussians model for face detection (Sung and Poggio 
1998) © 1998 IEEE. The face and non-face images (19 2 -long vectors) are first clustered into 
six separate clusters (each) using k-means and then analyzed using PCA. The cluster centers 
are shown in the right-hand columns. 


Clustering and PCA. Once the face and non-face patterns have been pre-processed. Sung 
and Poggio (1998) cluster each of these datasets into six separate clusters using k-means 
and then fit PCA subspaces to each of the resulting 12 clusters (Figure 14.4). At detection 
time, the DIFS and DFFS metrics first developed by Moghaddam and Pentland (1997) (see 
Figure 14.14 and (14.14)) are used to produce 24 Mahalanobis distance measurements (two 
per cluster). The resulting 24 measurements are input to a multi-layer perceptron (MLP), 
which is a neural network with alternating layers of weighted summations and sigmoidal non- 
linearities trained using the “backpropagation” algorithm (Rumelhart, Hinton, and Williams 
1986). 

Neural networks. Instead of first clustering the data and computing Mahalanobis distances 
to the cluster centers, Rowley, Baluja, and Kanade (1998a) apply a neural network (MLP) di- 
rectly to the 20 x 20 pixel patches of gray-level intensities, using a variety of differently sized 
hand-crafted “receptive fields” to capture both large-scale and smaller scale structure (Fig- 
ure 14.5). The resulting neural network directly outputs the likelihood of a face at the center 
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Figure 14.5 A neural network for face detection (Rowley, Baluja, and Kanade 1998a) © 
1998 IEEE. Overlapping patches are extracted from different levels of a pyramid and then 
pre-processed as shown in Figure 14.3b. A three-layer neural network is then used to detect 
likely face locations. 


of every overlapping patch in a multi-resolution pyramid. Since several overlapping patches 
(in both space and resolution) may fire near a face, an additional merging network is used 
to merge overlapping detections. The authors also experiment with training several networks 
and merging their outputs. Figure 14.2 shows a sample result from their face detector. 

To make the detector run faster, a separate network operating on 30 x 30 patches is trained 
to detect both faces and faces shifted by ±5 pixels. This network is evaluated at every 10th 
pixel in the image (horizontally and vertically) and the results of this “coarse” or “sloppy” 
detector are used to select regions on which to run the slower single-pixel overlap technique. 
To deal with in-plane rotations of faces, Rowley, Baluja, and Kanade (1998b) train a router 
network to estimate likely rotation angles from input patches and then apply the estimated 
rotation to each patch before running the result through their upright face detector. 


Support vector machines. Instead of using a neural network to classify patches, Osuna, 
Freund, and Girosi (1997) use a support vector machine (SVM) (Hastie, Tibshirani, and 
Friedman 2001; Scholkopf and Smola 2002; Bishop 2006; Lampert 2008) to classify the same 
preprocessed patches as Sung and Poggio (1998). An SVM searches for a series of maximum 
margin separating planes in feature space between different classes (in this case, face and 
non-face patches). In those cases where linear classification boundaries are insufficient, the 
feature space can be lifted into higher-dimensional features using kernels (Hastie, Tibshirani, 
and Friedman 2001; Scholkopf and Smola 2002; Bishop 2006). SVMs have been used by 
other researchers for both face detection and face recognition (Heisele, Ho, Wu et al. 2003; 
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(a) (b) 


Figure 14.6 Simple features used in boosting-based face detector (Viola and Jones 2004) 
© 2004 Springer: (a) difference of rectangle feature composed of 2^1 different rectangles 
(pixels inside the white rectangles are subtracted from the gray ones); (b) the first and second 
features selected by AdaBoost. The first feature measures the differences in intensity between 
the eyes and the cheeks, the second one between the eyes and the bridge of the nose. 


Heisele, Serre, and Poggio 2007) and are a widely used tool in object recognition in general. 


Boosting. Of all the face detectors currently in use, the one introduced by Viola and Jones 
(2004) is probably the best known and most widely used. Their technique was the first to 
introduce the concept of boosting to the computer vision community, which involves train- 
ing a series of increasingly discriminating simple classifiers and then blending their outputs 
(Hastie, Tibshirani, and Friedman 2001; Bishop 2006). 

In more detail, boosting involves constructing a classifier h{x) as a sum of simple weak 
learners, 

m— 1 


h(x) = sign 


a jhj(x) , 


(14.1) 


j = o 


where each of the weak learners hj (x) is an extremely simple function of the input, and hence 
is not expected to contribute much (in isolation) to the classification performance. 

In most variants of boosting, the weak learners are threshold functions. 


hj( x ) — a j[fj < @j\ + bj[fj > 6j\ — 


a j fj < 

bj otherwise. 


(14.2) 


which are also known as decision stumps (basically, the simplest possible version of decision 
trees). In most cases, it is also traditional (and simpler) to set aj and bj to ±1, i.e., a :) = —Sj, 
bj = +Sj, so that only the feature fj, the threshold value 0 3 , and the polarity of the threshold 
Sj € ±1 need to be selected. 4 


4 Some variants, such as that of Viola and Jones (2004), use (a 5 . bj) £ [0, 1] and adjust the learning algorithm 
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Figure 14.7 Schematic illustration of boosting, courtesy of Svetlana Lazebnik, after origi- 
nal illustrations from Paul Viola and David Lowe. After each weak classifier (decision stump 
or hyperplane) is selected, data points that are erroneously classified have their weights in- 
creased. The final classifier is a linear combination of the simple weak classifiers. 


In many applications of boosting, the features are simply coordinate axes Xk, he., the 
boosting algorithm selects one of the input vector components as the best one to threshold. In 
Viola and Jones’ face detector, the features are differences of rectangular regions in the input 
patch, as shown in Figure 14.6. The advantage of using these features is that, while they are 
more discriminating than single pixels, they are extremely fast to compute once a summed 
area table has been pre-computed, as described in Section 3.2.3 (3.31-3.32). Essentially, for 
the cost of an O(N) pre-computation phase (where N is the number of pixels in the image), 
subsequent differences of rectangles can be computed in 4 r additions or subtractions, where 
r £ {2, 3, 4} is the number of rectangles in the feature. 

The key to the success of boosting is the method for incrementally selecting the weak 
learners and for re-weighting the training examples after each stage (Figure 14.7). The Ad- 
aBoost (Adaptive Boosting) algorithm (Hastie, Tibshirani, and Friedman 2001; Bishop 2006) 
does this by re-weighting each sample as a function of whether it is correctly classified at each 
stage, and using the stage-wise average classification error to determine the final weightings 
aj among the weak classifiers, as described in Algorithm 14.1. While the resulting classi- 
fier is extremely fast in practice, the training time can be quite slow (in the order of weeks), 
because of the large number of feature (difference of rectangle) hypotheses that need to be 
examined at each stage. 

To further increase the speed of the detector, it is possible to create a cascade of classifiers, 
where each classifier uses a small number of tests (say, a two-term AdaBoost classifier) to 
reject a large fraction of non-faces while trying to pass through all potential face candidates 
(Fleuret and Geman 2001; Viola and Jones 2004). An even faster algorithm for performing 
cascade learning has recently been developed by Brubaker, Wu, Sun et al. (2008). 


accordingly. 
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1. Input the positive and negative training examples along with their labels {(a;*, y, ) }, 
where yi = 1 for positive (face) examples and y, - 1 for negative examples. 

2. Initialize all the weights to uyy <— where N is the number of training exam- 
ples. (Viola and Jones (2004) use a separate iVi and Ay for positive and negative 
examples.) 

3. For each training stage j = \ ... M\ 

(a) Renormalize the weights so that they sum up to 1 (divide them by their sum). 

(b) Select the best classifier hj(x ; fj, 9j , Sj ) by finding the one that minimizes 
the weighted classification error 


N—l 


e j = y! 

(14.3) 

2 — 0 


= 1 hji'X'u fj i @j i Sj')') • 

(14.4) 


For any given fj function, the optimal values of (Oj,Sj) can be found in 
linear time using a variant of weighted median computation (Exercise 14.2). 

(c) Compute the modified error rate (3j and classifier weight a ;) , 

(3j = — — — and ay = — log ,3j . (14.5) 

1 — ej 

(d) Update the weights according to the classification errors e,y 

m,j+i <— w i,jPj~ ei ' J i (14.6) 


i.e., downweight the training samples that were correctly classified in pro- 
portion to the overall classification error. 


4. Set the final classifier to 


h(x) = sign 


m— 1 
3=0 


(14.7) 


Algorithm 14.1 The AdaBoost training algorithm, adapted from Hastie, Tibshirani, and 
Friedman (2001), Viola and Jones (2004), and Bishop (2006). 
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Figure 14.8 Pedestrian detection using histograms of oriented gradients (Dalai and Triggs 
2005) © 2005 IEEE: (a) the average gradient image over the training examples; (b) each 
“pixel” shows the maximum positive SVM weight in the block centered on the pixel; (c) like- 
wise, for the negative SVM weights; (d) a test image; (e) the computed R-HOG (rectangular 
histogram of gradients) descriptor; (f) the R-HOG descriptor weighted by the positive SVM 
weights; (g) the R-HOG descriptor weighted by the negative SVM weights. 


14.1.2 Pedestrian detection 

While a lot of the research on object detection has focused on faces, the detection of other 
objects, such as pedestrians and cars, has also received widespread attention (Gavrila and 
Philomin 1999; Gavrila 1999; Papageorgiou and Poggio 2000; Mohan, Papageorgiou, and 
Poggio 2001; Schneiderman and Kanade 2004). Some of these techniques maintain the same 
focus as face detection on speed and efficiency. Others, however, focus instead on accuracy, 
viewing detection as a more challenging variant of generic class recognition (Section 14.4) 
in which the locations and extents of objects are to be determined as accurately as possible. 
(See, for example, the PASCAL VOC detection challenge, http://pascallin.ecs.soton.ac.uk/ 
challenges/V OC/.) 

An example of a well-known pedestrian detector is the algorithm developed by Dalai 
and Triggs (2005), who use a set of overlapping histogram of oriented gradients (HOG) de- 
scriptors fed into a support vector machine (Figure 14.8). Each HOG has cells to accumulate 
magnitude-weighted votes for gradients at particular orientations, just as in the scale invariant 
feature transform (SIFT) developed by Lowe (2004), which we discussed in Section 4.1.2 and 
Figure 4.18. Unlike SIFT, however, which is only evaluated at interest point locations, HOGs 
are evaluated on a regular overlapping grid and their descriptor magnitudes are normalized 
using an even coarser grid; they are only computed at a single scale and a fixed orientation. In 
order to capture the subtle variations in orientation around a person’s outline, a large number 
of orientation bins is used and no smoothing is performed in the central difference gradi- 
ent computation — see the work of Dalai and Triggs (2005) for more implementation details. 
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Figure 14.8d shows a sample input image, while Figure 14. 8e shows the associated HOG 
descriptors. 

Once the descriptors have been computed, a support vector machine (SVM) is trained 
on the resulting high-dimensional continuous descriptor vectors. Figures 14.8b-c show a 
diagram of the (most) positive and negative SVM weights in each block, while Figures 14.8f- 
g show the corresponding weighted HOG responses for the central input image. As you can 
see, there are a fair number of positive responses around the head, torso, and feet of the 
person, and relatively few negative responses (mainly around the middle and the neck of the 
sweater). 

The fields of pedestrian and general object detection have continued to evolve rapidly 
over the last decade (Belongie, Malik, and Puzicha 2002; Mikolajczyk, Schmid, and Zis- 
serman 2004; Leibe, Seemann, and Schiele 2005; Opelt, Pinz, and Zisserman 2006; Tor- 
ralba 2007; Andriluka, Roth, and Schiele 2009, 2010; Dollar, Belongie, and Perona 2010). 
Munder and Gavrila (2006) compare a number of pedestrian detectors and conclude that 
those based on local receptive fields and SVMs perform the best, with a boosting-based ap- 
proach coming close. Maji, Berg, and Malik (2008) improve on the best of these results using 
non-overlapping multi-resolution HOG descriptors and a histogram intersection kernel SVM 
based on a spatial pyramid match kernel from Lazebnik, Schmid, and Ponce (2006). 

When detectors for several different classes are being constmcted simultaneously, Tor- 
ralba, Murphy, and Freeman (2007) show that sharing features and weak learners between 
detectors yields better performance, both in terms of faster computation times and fewer 
training examples. To find the features and decision stumps that work best in a shared man- 
ner, they introduce a novel joint boosting algorithm that optimizes, at each stage, a summed 
expected exponential loss function using the “gentleboost” algorithm of Friedman, Hastie, 
and Tibshirani (2000). 

In more recent work, Felzenszwalb, McAllester, and Ramanan (2008) extend the his- 
togram of oriented gradients person detector to incorporate flexible parts models (Section 14.4.2). 
Each part is trained and detected on HOGs evaluated at two pyramid levels below the overall 
object model and the locations of the parts relative to the parent node (the overall bounding 
box) are also learned and used during recognition (Figure 14.9b). To compensate for inac- 
curacies or inconsistencies in the training example bounding boxes (dashed white lines in 
Figure 14.9c), the “true” location of the parent (blue) bounding box is considered a latent 
(hidden) variable and is inferred during both training and recognition. Since the locations 
of the parts are also latent, the system can be trained in a semi-supervised fashion, without 
needing part labels in the training data. An extension to this system (Felzenszwalb, Girshick, 
McAllester et al. 2010), which includes among its improvements a simple contextual model, 
was among the two best object detection systems in the 2008 Visual Object Classes detection 
challenge. Other recent improvements to part-based person detection and pose estimation in- 
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Figure 14.9 Part-based object detection (Felzenszwalb, McAllester, and Ramanan 2008) 
© 2008 IEEE: (a) An input photograph and its associated person (blue) and part (yellow) 
detection results, (b) The detection model is defined by a coarse template, several higher 
resolution part templates, and a spatial model for the location of each part, (c) True positive 
detection of a skier and (d) false positive detection of a cow (labeled as a person). 


elude the work by Andriluka, Roth, and Schiele (2009) and Kumar, Zisserman, and H.S.Torr 
(2009). 

An even more accurate estimate of a person’s pose and location is presented by Rogez, 
Rihan, Ramalingam et al. (2008), who compute both the phase of a person in a walk cycle and 
the locations of individual joints, using random forests built on top of HOGs (Figure 14.1 1). 
Since their system produces full 3D pose information, it is closer in its application domain to 
3D person trackers (Sidenbladh, Black, and Fleet 2000; Andriluka, Roth, and Schiele 2010), 
which we discussed in Section 12.6.4. 

One final note on person and object detection. When video sequences are available, the 
additional information present in the optic flow and motion discontinuities can greatly aid in 
the detection task, as discussed by Efros, Berg, Mori el al. (2003), Viola, Jones, and Snow 
(2003), and Dalai, Triggs, and Schmid (2006). 

14.2 Face recognition 

Among the various recognition tasks that computers might be asked to perform, face recog- 
nition is the one where they have arguably had the most success. 5 While computers cannot 
pick out suspects from thousands of people streaming in front of video cameras (even people 
cannot readily distinguish between similar people with whom they are not familiar (O’Toole, 
Jiang, Roark el al. 2006; O’Toole, Phillips, Jiang el al. 2009)), their ability to distinguish 

5 Instance recognition, i.e., the re-recognition of known objects such as locations or planar objects, is the other 
most successful application of general image recognition. In the general domain of biometrics, i.e., identity recogni- 
tion, specialized images such as irises and fingerprints perform even better (Jain, Bolle, and Pankanti 1999; Pankanti, 
Bolle, and Jain 2000; Daugman 2004). 
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Figure 14.10 Part-based object detection results for people, bicycles, and horses (Felzen- 
szwalb, McAllester, and Ramanan 2008) © 2008 IEEE. The first three columns show correct 
detections, while the rightmost column shows false positives. 



Figure 14.11 Pose detection using random forests (Rogez, Rihan, Ramalingam et al. 2008) 
© 2008 IEEE. The estimated pose (state of the kinematic model) is drawn over each input 
frame. 
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Figure 14.12 Humans can recognize low-resolution faces of familiar people (Sinha, Balas, 
Ostrovsky et al. 2006) © 2006 IEEE. 


among a small number of family members and friends has found its way into consumer-level 
photo applications, such as Picasa and iPhoto. Face recognition can also be used in a variety 
of additional applications, including human-computer interaction (HCI), identity verification 
(Kirovski, Jojic, and Jancke 2004), desktop login, parental controls, and patient monitoring 
(Zhao, Chellappa, Phillips et al. 2003). 

Today’s face recognizers work best when they are given full frontal images of faces under 
relatively uniform illumination conditions, although databases that include large amounts 
of pose and lighting variation have been collected (Phillips, Moon, Rizvi et al. 2000; Sim, 
Baker, and Bsat 2003; Gross, Shi, and Cohn 2005; Huang, Ramesh, Berg et al. 2007; Phillips, 
Scruggs, O’Toole et al. 2010). (See Table 14.1 in Section 14.6 for more details.) 

Some of the earliest approaches to face recognition involved finding the locations of 
distinctive image features, such as the eyes, nose, and mouth, and measuring the distances 
between these feature locations (Fischler and Elschlager 1973; Kanade 1977; Yuille 1991). 
More recent approaches rely on comparing gray-level images projected onto lower dimen- 
sional subspaces called eigenfaces (Section 14.2.1) and jointly modeling shape and appear- 
ance variations (while discounting pose variations) using active appearance models (Sec- 
tion 14.2.2). 

Descriptions of additional face recognition techniques can be found in a number of sur- 
veys and books on this topic (Chellappa, Wilson, and Sirohey 1995; Zhao, Chellappa, Phillips 
et al. 2003; Li and Jain 2005) as well as the Face Recognition Web site. 6 The survey on face 
recognition by humans by Sinha, Balas, Ostrovsky et al. (2006) is also well worth reading; it 
includes a number of surprising results, such as humans’ ability to recognize low-resolution 
images of familiar faces (Figure 14.12) and the importance of eyebrows in recognition. 


http ://www.face- rec .org/. 
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Figure 14.13 Face modeling and compression using eigenfaces (Moghaddam and Pentland 
1997) © 1997 IEEE: (a) input image; (b) the first eight eigenfaces; (c) image reconstructed 
by projecting onto this basis and compressing the image to 85 bytes; (d) image reconstructed 
using JPEG (530 bytes). 


14.2.1 Eigenfaces 

Eigenfaces rely on the observation first made by Kirby and Sirovich (1990) that an arbitrary 
face image x can be compressed and reconstructed by starting with a mean image m (Fig- 
ure 14.1b) and adding a small number of scaled signed images it;, 7 

M - 1 

x = m + E diUi, (14.8) 

•i — 0 

where the signed basis images (Figure 14.13b) can be derived from an ensemble of train- 
ing images using principal component analysis (also known as eigenvalue analysis or the 
Karhunen-Loeve transform). Turk and Pentland (1991a) recognized that the coefficients a, 
in the eigenface expansion could themselves be used to construct a fast image matching algo- 
rithm. 

In more detail, let us start with a collection of training images { x t } , from which we can 
compute the mean image m and a scatter or covariance matrix 

i N ~ l * * * * * 7 

C= ^{x j -rn)(x j -rn) T . (14.9) 

j = o 

We can apply the eigenvalue decomposition (A. 6) to represent this matrix as 

N-l 

C = UAU t = x i u i u L ( 14 . 10 ) 

i—0 

where the A * are the eigenvalues of C and the U{ are the eigenvectors. For general im- 
ages, Kirby and Sirovich (1990) call these vectors eigenpictures ; for faces, Turk and Pentland 

7 In previous chapters, we used I to indicate images; in this chapter, we use the more abstract quantities x and u 
to indicate collections of pixels in an image turned into a vector. 
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Figure 14.14 Projection onto the linear subspace spanned by the eigenface images (Moghad- 
dam and Pentland 1997) © 1997 IEEE. The distance from face space (DFFS) is the orthog- 
onal distance to the plane, while the distance in face space (DIFS) is the distance along the 
plane from the mean image. Both distances can be turned into Mahalanobis distances and 
given probabilistic interpretations. 


(1991a) call them eigenfaces (Figure 14.13b). 8 

Two important properties of the eigenvalue decomposition are that the optimal (best ap- 
proximation) coefficients a* for any new image x can be computed as 


di = (x — m) ■ Ui , 


(14.11) 


and that, assuming the eigenvalues {A,} are sorted in decreasing order, truncating the ap- 
proximation given in (14.8) at any point M gives the best possible approximation (least er- 
ror) between x and x. Figure 14.13c shows the resulting approximation corresponding to 
Figure 14.13a and shows how much better it is at compressing a face image than JPEG. 

Truncating the eigenface decomposition of a face image (14.8) after M components is 
equivalent to projecting the image onto a linear subspace F, which we can call the face space 
(Figure 14.14). Because the eigenvectors (eigenfaces) are orthogonal and of unit norm, the 
distance of a projected face x to the mean face m can be written as 


DIFS = \\x — m\\ 


M - 1 



(14.12) 


where DIFS stands for distance in face space (Moghaddam and Pentland 1997). The re- 
maining distance between the original image x and its projection onto face space 5, i.e., the 

8 In actual practice, the full P X P scatter matrix (14.9) is never computed. Instead, a smaller N X N matrix con- 
sisting of the inner products between all the signed deviations ( Xi — m) is accumulated instead. See Appendix A. 1 .2 
(A.13-A.14) for details. 
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distance from face space (DFFS), can be computed directly in pixel space and represents the 
“faceness” of a particular image. 9 It is also possible to measure the distance between two 
different faces in face space as 


DIFS(ai, y) = \\x - y || 


M—l 


N 


^2 («i - b i) 2 , 

i=0 


(14.13) 


where the bi = (y — m) ■ Ui are the eigenface coefficients corresponding to y. 

Computing such distances in Euclidean vector space, however, does not exploit the ad- 
ditional information that the eigenvalue decomposition of our covariance matrix (14.10) pro- 
vides. If we interpret the covariance matrix C as the covariance of a multi-variate Gaussian 
(Appendix B. 1.1), 10 we can turn the DIFS into a log likelihood by computing the Maha- 
lanobis distance 


DIFS' = \\x-m\\ c -i 


\ 


M—l 




i=0 


(14.14) 


Instead of measuring the squared distance along each principal component in face space F, 
the Mahalanobis distance measures the ratio between the squared distance and the corre- 
sponding variance of = A, and then sums these squared ratios (per-component log-likelihoods) 
An alternative way to implement this is to pre-scale each eigenvector by the inverse square 
root of its corresponding eigenvalue. 


U = UA- 1/2 . 


(14.15) 


This whitening transformation then means that Euclidean distances in feature (face) space 
now correspond directly to log likelihoods (Moghaddam, Jebara, and Pentland 2000). (This 
same whitening approach can also be used in feature-based matching algorithms, as discussed 
in Section 4.1.3.) 

If the distribution in eigenface space is very elongated, the Mahalanobis distance properly 
scales the components to come up with a sensible (probabilistic) distance from the mean. 
A similar analysis can be performed for computing a sensible difference from face space 
(DFFS) (Moghaddam and Pentland 1997) and the two terms can be combined to produce an 
estimate of the likelihood of being a true face, which can be useful in doing face detection 
(Section 14. 1.1). More detailed explanations of probabilistic and Bayesian PCA can be found 
in textbooks on statistical learning (Hastie, Tibshirani, and Friedman 2001; Bishop 2006), 
which also discuss techniques for selecting the optimum number of components M to use in 
modeling a distribution. 


9 This can be used to form a simple face detector, as mentioned in Section 14.1.1. 

10 The ellipse shown in Figure 14.14 denotes an equi-probability contour of this multi-variate Gaussian. 
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Figure 14.15 Images from the Harvard database used by Belhumeur, Hespanha, and Krieg- 
man (1997) © 1997 IEEE. Note the wide range of illumination variation, which can be more 
dramatic than inter-personal variations. 


One of the biggest advantages of using eigenfaces is that they reduce the comparison 
of a new face image a; to a prototype (training) face image x & (one of the colored vs in 
Figure 14.14) from a P-dimensional difference in pixel space to an A/ -dimensional difference 
in face space, 

I!* - x k \\ = \\a - a k \\, (14.16) 

where a = U T (x — m) (14.11) involves computing a dot product between the signed 
difference-from-mean image ( x — m) and each of the eigenfaces Wj. Once again, however, 
this Euclidean distance ignores the fact that we have more information about face likelihoods 
available in the distribution of training images. 

Consider the set of images of one person taken under a wide range of illuminations shown 
in Figure 14.15. As you can see, the intrapersonal variability within these images is much 
greater than the typical extrapersonal variability between any two people taken under the 
same illumination. Regular PCA analysis fails to distinguish between these two sources of 
variability and may, in fact, devote most of its principal components to modeling the intrap- 
ersonal variability. 

If we are going to approximate faces by a linear subspace, it is more useful to have a 
space that discriminates between different classes (people) and is less sensitive to within-class 
variations (Belhumeur, Hespanha, and Kriegman 1997). Consider the three classes shown as 
different colors in Figure 14.16. As you can see, the distributions within a class (indicated 
by the tilted colored axes) are elongated and tilted with respect to the main face space PCA, 
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Figure 14.16 Simple example of Fisher linear discriminant analysis. The samples come 
from three different classes, shown in different colors along with their principal axes, which 
are scaled to 2a,. (The intersections of the tilted axes are the class means rn k .) The dashed 
line is the (dominant) Fisher linear discriminant direction and the dotted lines are the linear 
discriminants between the classes. Note how the discriminant direction is a blend between 
the principal directions of the between-class and within-class scatter matrices. 


which is aligned with the black x and y axes. We can compute the total within-class scatter 
matrix as 

K - 1 K - 1 

S w = ^2 S k = ^2 ( Xi ~ m k)( x ‘ ~ m k) T , (14.17) 

k—0 k—0 iGCfc 

where m k is the mean of class k and S k is its within-class scatter matrix. 11 Similarly, we 
can compute the between-class scatter as 

K - 1 

Sb = ^2 N k {m k - m)(m k - m ) T , (14.18) 

fc= o 

where N k are the number of exemplars in each class and m is the overall mean. For the three 
distributions shown in Figure 14.16, we have 


S w = 3 N 


0.246 0.183 
0.183 0.457 


and S b = N 


6.125 0 

0 0.375 


(14.19) 


11 To be consistent with Belhumeur, Hespanha, and Kriegman (1997), we use Sw and Sb to denote the scatter 
matrices, even though we use C elsewhere (14.9). 
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where N = Nk = 13 is the number of samples in each class. 

To compute the most discriminating direction, Fisher’s linear discriminant (FLD) (Bel- 
humeur, Hespanha, and Kriegman 1997; Hastie, Tibshirani, and Friedman 2001; Bishop 
2006), which is also known as linear discriminant analysis (LDA), selects the direction u 
that results in the largest ratio between the projected between-class and within-class varia- 
tions 

* u t S b u 

u = arg max , (14.20) 

u u 1 &w u 

which is equivalent to finding the eigenvector corresponding to the largest eigenvalue of the 
generalized eigenvalue problem 

S B u=\Sw u or Xu=S^S B u. (14.21) 


For the problem shown in Figure 14.16, 


S 


-l 

w 


Sb = 


11.796 -0.289 

-4.715 0.3889 


and u = 


0.926 

-0.379 


(14.22) 


As you can see, using this direction results in a better separation between the classes than 
using the dominant PCA direction, which is the horizontal axis. In their paper, Belhumeur, 
Hespanha, and Kriegman (1997) show that Fisherfaces significantly outperform the original 
eigenfaces algorithm, especially when faces have large amounts of illumination variation, as 
in Figure 14.15. 

An alternative for modeling within-class (intrapersonal) and between-class (extraper- 
sonal) variations is to model each distribution separately and then use Bayesian techniques 
to find the closest exemplar (Moghaddam, Jebara, and Pentland 2000). Instead of computing 
the mean for each class and then the within-class and between-class distributions, consider 
evaluating the difference images 

A ij = Xi - Xj (14.23) 

between all pairs of training images (x t . Xj). The differences between pairs that are in the 
same class (the same person) are used to estimate the intrapersonal covariance matrix E/, 
while differences between different people are used to estimate the extrapersonal covariance 
E /.; . 1 2 The principal components (eigenfaces) corresponding to these two classes are shown 
in Figure 14.17. 

At recognition time, we can compute the distance A, between a new face x and a stored 
training image x r and evaluate its intrapersonal likelihood as 

Pr(Ai) =p A r(A i ;E / ) = ^ exp -|| AiH^-!, (14.24) 

12 Note that the difference distributions are zero mean because for every A ij there corresponds a negative A ji. 
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(b) 


Figure 14.17 “Dual” eigenfaces (Moghaddam, Jebara, and Pentland 2000) © 2000 Elsevier: 
(a) intrapersonal and (b) extrapersonal. 


where p^f is a normal (Gaussian) distribution with covariance X/ and 

M 

I27TE/I 1 / 2 = (27 r) M / 2 JJ X 1 / 2 (14.25) 

7 = 1 

is its volume. The Mahalanobis distance 


=AfS7 1 A,= 


la 1 - a- II 2 


(14.26) 


can be computed more efficiently by first projecting the new image x into the whitened in- 
trapersonal face space (14.15) 

a 1 = U T x (14.27) 

and then computing a Euclidean distance to the training image vector aj , which can be pre- 
computed offline. The extrapersonal likelihood ps(Aj) can be computed in a similar fashion. 

Once the intrapersonal and extrapersonal likelihoods have been computed, we can com- 
pute the Bayesian likelihood of a new image x matching a training image x, as 


P( A i) = 


Pi(A.j)li 
Pl{Ai)ll + Pe{^i)Ie ’ 


(14.28) 


where / / and Ie are the prior probabilities of two images being in the same or in different 
classes (Moghaddam, Jebara, and Pentland 2000). A simpler approach, which does not re- 
quire the evaluation of extrapersonal probabilities, is to simply choose the training image with 
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(a) (b) 

Figure 14.18 Modular eigenspace for face recognition (Moghaddam and Pentland 1997) 
© 1997 IEEE, (a) By detecting separate features in the faces (eyes, nose, mouth), separate 
eigenspaces can be estimated for each one. (b) The relative positions of each feature can be 
detected at recognition time, thus allowing for more flexibility in viewpoint and expression. 


the highest likelihood p/(Aj). In this case, nearest neighbor search techniques in the space 
spanned by the precomputed {aj } vectors could be used to speed up finding the best match. 13 

Another way to improve the performance of eigenface-based approaches is to break up 
the image into separate regions such as the eyes, nose, and mouth (Figure 14.18) and to match 
each of these modular eigenspaces independently (Moghaddam and Pentland 1997; Heisele, 
Ho, Wu et al. 2003; Heisele, Serre, and Poggio 2007). The advantage of such a modular 
approach is that it can tolerate a wider range of viewpoints, because each part can move 
relative to the others. It also supports a larger variety of combinations, e.g., we can model one 
person as having a narrow nose and bushy eyebrows, without requiring the eigenfaces to span 
all possible combinations of nose, mouth, and eyebrows. (If you remember the cardboard 
children’s books where you can select different top and bottom faces, or Mr. Potato Head, 
you get the idea.) 

Another approach to dealing with large variability in appearance is to create view-based 
(view-specific) eigenspaces, as shown in Figure 14.19 (Moghaddam and Pentland 1997). We 
can think of these view-based eigenspaces as local descriptors that select different axes de- 
pending on which part of the face space you are in. Note that such approaches, however, 
potentially require large amounts of training data, i.e., pictures of every person in every pos- 
sible pose or expression. This is in contrast to the shape and appearance models we study in 

13 Note that while the covariance matrices Sj and S /,; are computed by looking at differences between all pairs of 
images, the run-time evaluation selects the nearest image to determine the facial identity. Whether this is statistically 
correct is explored in Exercise 14.4. 
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Figure 14.19 View-based eigenspace (Moghaddam and Pentland 1997) © 1997 IEEE, (a) 
Comparison between a regular (parametric) eigenspace reconstruction (middle column) and 
a view-based eigenspace reconstruction (right column) corresponding to the input image (left 
column). The top row is from a training image, the bottom row is from the test set. (b) A 
schematic representation of the two approaches, showing how each view computes its own 
local basis representation. 


Section 14.2.2, which can learn deformations across all individuals. 

It is also possible to generalize the bilinear factorization implicit in PCA and SVD ap- 
proaches to multilinear (tensor) formulations that can model several interacting factors si- 
multaneously (Vasilescu and Terzopoulos 2007). These ideas are related to currently active 
topics in machine learning such as subspace learning (Cai, He, Hu el al. 2007), local distance 
functions (Frame, Singer, Sha et al. 2007), and metric learning (Ramanan and Baker 2009). 
Learning approaches play an increasingly important role in face recognition, e.g., in the work 
of Sivic, Everingham, and Zisserman (2009) and Guillaumin, Verbeek, and Schmid (2009). 


14.2.2 Active appearance and 3D shape models 

The need to use modular or view-based eigenspaces for face recognition is symptomatic of 
a more general observation, i.e., that facial appearance and identifiability depend as much 
on shape as they do on color or texture (which is what eigenfaces capture). Furthermore, 
when dealing with 3D head rotations, the pose of a person’s head should be discounted when 
performing recognition. 

In fact, the earliest face recognition systems, such as those by Fischler and Elschlager 
(1973), Kanade (1977), and Yuille (1991), found distinctive feature points on facial images 
and performed recognition on the basis of their relative positions or distances. Newer tech- 
niques such as local feature analysis (Penev and Atick 1996) and elastic bunch graph match- 
ing (Wiskott, Fellous, Kruger el al. 1997) combine local filter responses (jets) at distinctive 
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Figure 14.20 Manipulating facial appearance through shape and color (Rowland and Perrett 
1995) © 1995 IEEE. By adding or subtracting gender-specific shape and color characteristics 
to (b) an input image, different amounts of gender variation can be induced. The amounts 
added (from the mean) are: (a) +50% (gender enhancement), (c) -50% (near “androgyny”), 
(d) -100% (gender switched), and (e) -150% (opposite gender attributes enhanced). 


feature locations together with shape models to perform recognition. 

A visually compelling example of why both shape and texture are important is the work 
of Rowland and Perrett (1995), who manually traced the contours of facial features and then 
used these contours to normalize (warp) each image to a canonical shape. After analyzing 
both the shape and color images for deviations from the mean, they were able to associate 
certain shape and color deformations with personal characteristics such as age and gender 
(Figure 14.20). Their work demonstrates that both shape and color have an important influ- 
ence on the perception of such characteristics. 

Around the same time, researchers in computer vision were beginning to use simultane- 
ous shape deformations and texture interpolation to model the variability in facial appearance 
caused by identity or expression (Beymer 1996; Vetter and Poggio 1997), developing tech- 
niques such as Active Shape Models (Lanitis, Taylor, and Cootes 1997), 3D Morphable Mod- 
els (Blanz and Vetter 1999), and Elastic Bunch Graph Matching (Wiskott, Fellous, Kruger et 
al. 1997). 14 

Of all these techniques, the active appearance models (AAMs) of Cootes, Edwards, and 
Taylor (2001) are among the most widely used for face recognition and tracking. Like other 
shape and texture models, an AAM models both the variation in the shape of an image s, 
which is normally encoded by the location of key feature points on the image (Figure 14.21b), 

14 We have already seen the application of PCA to 3D head and face modeling and animation in Section 12.6.3. 
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(a) (b) (c) 


Figure 14.21 Active Appearance Models (Cootes, Edwards, and Taylor 2001) © 2001 
IEEE: (a) input image with registered feature points; (b) the feature points (shape vector 
.s); (c) the shape-free appearance image (texture vector t). 


as well as the variation in texture t, which is normalized to a canonical shape before being 
analyzed (Figure 14.21c). 15 

Both shape and texture are represented as deviations from a mean shape s and texture t, 

s = s + U s a (14.29) 

t = t + U t a, (14.30) 

where the eigenvectors in U s and U f have been pre-scaled (whitened) so that unit vectors in 
a represent one standard deviation of variation observed in the training data. In addition to 
these principal deformations, the shape parameters are transformed by a global similarity to 
match the location, size, and orientation of a given face. Similarly, the texture image contains 
a scale and offset to best match novel illumination conditions. 

As you can see, the same appearance parameters a in (14.29-14.30) simultaneously con- 
trol both the shape and texture deformations from the mean, which makes sense if we believe 
them to be correlated. Figure 14.22 shows how moving three standard deviations along each 
of the first four principal directions ends up changing several correlated factors in a person’s 
appearance, including expression, gender, age, and identity. 

In order to fit an active appearance model to a novel image, Cootes, Edwards, and Taylor 
(2001) pre-compute a set of “difference decomposition” images, using an approach related to 
other fast techniques for incremental tracking, such as those we discussed in Sections 4. 1 .4, 
8.1.3, and 8.2 (Gleicher 1997; Hager and Belhumeur 1998), which often learn a discrimi- 
native mapping between matching errors and incremental displacements (Avidan 2001; Jurie 
and Dhome 2002; Liu, Chen, and Kumar 2003; Sclaroff and Isidoro 2003; Romdhani and 
Vetter 2003; Williams, Blake, and Cipolla 2003). 

15 When only the shape variation is being captured, such models are called active shape models (ASMs) (Cootes, 
Cooper, Taylor et al. 1995; Davies, Twining, and Taylor 2008). These were already discussed in Section 5.1.1 
(5.13-5.17). 
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(a) (b) 




(C) (d) 

Figure 14.22 Principal modes of variation in active appearance models (Cootes, Edwards, 
and Taylor 2001) © 2001 IEEE. The four images show the effects of simultaneously changing 
the first four modes of variation in both shape and texture by ±rr from the mean. You can 
clearly see how the shape of the face and the shading are simultaneously affected. 


In more detail, Cootes, Edwards, and Taylor (2001) compute the derivatives of a set of 
training images with respect to each of the parameters in a using finite differences and then 
compute a set of displacement weight images 


W = 


dx T 

da 


dx 

da 


dx T 

~da' 


(14.31) 


which can be multiplied by the current error residual to produce an update step in the pa- 
rameters, 8a = —Wr. Matthews and Baker (2004) use their inverse compositional method, 
which they first developed for parametric optical flow (8.64-8.65), to further speed up active 
appearance model fitting and tracking. Examples of AAMs being fitted to two input images 
are shown in Figure 14.23. 

Although active appearance models are primarily designed to accurately capture the vari- 
ability in appearance and deformation that are characteristic of faces, they can be adapted to 
face recognition by computing an identity subspace that separates variation in identity from 
other sources of variability such as lighting, pose, and expression (Costen, Cootes, Edwards 
et al. 1999). The basic idea, which is modeled after similar work in eigenfaces (Belhumeur, 
Hespanha, and Kriegman 1997; Moghaddam, Jebara, and Pentland 2000), is to compute sep- 
arate statistics for intrapersonal and extrapersonal variation and then find discriminating di- 
rections in these subspaces. While AAMs have sometimes been used directly for recognition 
(Blanz and Vetter 2003), their main use in the context of recognition is to align faces into 
a canonical pose (Liang, Xiao, Wen et al. 2008) so that more traditional methods of face 
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Figure 14.23 Multiresolution model fitting (search) in active appearance models (Cootes, 
Edwards, and Taylor 2001) © 2001 IEEE. The columns show the initial model, the results 
after 3, 8, and 11 iterations, and the final convergence. The rightmost column shows the input 
image. 


recognition (Penev and Atick 1996; Wiskott, Fellous, Kruger et al. 1997; Ahonen, Hadid, 
and Pietikainen 2006; Zhao and Pietikainen 2007; Cao, Yin, Tang et al. 2010) can be used. 
AAMs (or, actually, their simpler version. Active Shape Models (ASMs)) can also be used to 
align face images to perform automated morphing (Zanella and Fuentes 2004). 

Active appearance models continue to be an active research area, with enhancements to 
deal with illumination and viewpoint variation (Gross, Baker, Matthews et al. 2005) as well 
as occlusions (Gross, Matthews, and Baker 2006). One of the most significant extensions is 
to construct 3D models of shape (Matthews, Xiao, and Baker 2007), which are much better at 
capturing and explaining the full variability of facial appearance across wide changes in pose. 



Figure 14.24 Head tracking with 3D AAMs (Matthews, Xiao, and Baker 2007) © 2007 
Springer. Each image shows a video frame along with the estimate yaw, pitch, and roll 
parameters and the fitted 3D deformable mesh. 
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Figure 14.25 Person detection and re -recognition using a combined face, hair, and torso 
model (Sivic, Zitnick, and Szeliski 2006) © 2006 Springer, (a) Using face detection alone, 
several of the heads are missed, (b) The combined face and clothing model successfully 
re-finds all the people. 


Such models can be constructed either from monocular video sequences (Matthews, Xiao, 
and Baker 2007), as shown in Figure 14.24, or from multi-view video sequences (Ramnath, 
Koterba, Xiao el al. 2008), which provide even greater reliability and accuracy in reconstruc- 
tion and tracking. (For a recent review of progress in head pose estimation, please see the 
survey paper by Murphy-Chutorian and Trivedi (2009).) 


14.2.3 Application : Personal photo collections 

In addition to digital cameras automatically finding faces to aid in auto-focusing and video 
cameras finding faces in video conferencing to center on the speaker (either mechanically 
or digitally), face detection has found its way into most consumer-level photo organization 
packages, such as iPhoto, Picasa, and Windows Live Photo Gallery. Finding faces and al- 
lowing users to tag them makes it easier to find photos of selected people at a later date or to 
automatically share them with friends. In fact, the ability to tag friends in photos is one of the 
more popular features on Facebook. 

Sometimes, however, faces can be hard to find and recognize, especially if they are small, 
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Figure 14.26 Recognizing objects in a cluttered scene (Lowe 2004) © 2004 Springer. Two 
of the training images in the database are shown on the left. They are matched to the cluttered 
scene in the middle using SIFT features, shown as small squares in the right image. The affine 
warp of each recognized database image onto the scene is shown as a larger parallelogram in 
the right image. 


turned away from the camera, or otherwise occluded. In such cases, combining face recog- 
nition with person detection and clothes recognition can be very effective, as illustrated in 
Figure 14.25 (Sivic, Zitnick, and Szeliski 2006). Combining person recognition with other 
kinds of context, such as location recognition (Section 14.3.3) or activity or event recognition, 
can also help boost performance (Lin, Kapoor, Hua et al. 2010). 


14.3 Instance recognition 

General object recognition falls into two broad categories, namely instance recognition and 
class recognition. The former involves re-recognizing a known 2D or 3D rigid object, poten- 
tially being viewed from a novel viewpoint, against a cluttered background, and with partial 
occlusions. The latter, which is also known as category-level or generic object recognition 
(Ponce, Hebert, Schmid et al. 2006), is the much more challenging problem of recognizing 
any instance of a particular general class such as “cat”, “car”, or “bicycle”. 

Over the years, many different algorithms have been developed for instance recognition. 
Mundy (2006) surveys earlier approaches, which focused on extracting lines, contours, or 
3D surfaces from images and matching them to known 3D object models. Another popu- 
lar approach was to acquire images from a large set of viewpoints and illuminations and to 
represent them using an eigenspace decomposition (Murase and Nayar 1995). More recent 
approaches (Lowe 2004; Rothganger, Lazebnik, Schmid et al. 2006; Ferrari, Tuytelaars, and 
Van Gool 2006b; Gordon and Lowe 2006; Obdrzalek and Matas 2006; Sivic and Zisserman 
2009) tend to use viewpoint-invariant 2D features, such as those we saw in Section 4.1.2. Af- 
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ter extracting informative sparse 2D features from both the new image and the images in the 
database, image features are matched against the object database, using one of the sparse fea- 
ture matching strategies described in Section 4.1.3. Whenever a sufficient number of matches 
have been found, they are verified by finding a geometric transformation that aligns the two 
sets of features (Figure 14.26). 

Below, we describe some of the techniques that have been proposed for representing the 
geometric relationships between such features (Section 14.3.1). We also discuss how to make 
the feature matching process more efficient using ideas from text and information retrieval 
(Section 14.3.2). 


14.3.1 Geometric alignment 

To recognize one or more instances of some known objects, such as those shown in the left 
column of Figure 14.26, the recognition system first extracts a set of interest points in each 
database image and stores the associated descriptors (and original positions) in an indexing 
structure such as a search tree (Section 4.1.3). At recognition time, features are extracted 
from the new image and compared against the stored object features. Whenever a sufficient 
number of matching features (say, three or more) are found for a given object, the system then 
invokes a match verification stage, whose job is to determine whether the spatial arrangement 
of matching features is consistent with those in the database image. 

Because images can be highly cluttered and similar features may belong to several objects, 
the original set of feature matches can have a large number of outliers. For this reason, Lowe 
(2004) suggests using a Hough transform (Section 4.3.2) to accumulate votes for likely geo- 
metric transformations. In his system, he uses an affine transformation between the database 
object and the collection of scene features, which works well for objects that are mostly pla- 
nar, or where at least several corresponding features share a quasi-planar geometry. 16 

Since SIFT features carry with them their own location, scale, and orientation, Lowe uses 
a four-dimensional similarity transformation as the original Hough binning structure, i.e., 
each bin denotes a particular location for the object center, scale, and in-plane rotation. Each 
matching feature votes for the nearest 2 4 bins and peaks in the transform are then selected for 
a more careful affine motion fit. Figure 14.26 (right image) shows three instances of the two 
objects on the left that were recognized by the system. Obdrzalek and Matas (2006) general- 
ize Lowe’s approach to use feature descriptors with full local affine frames and evaluate their 
approach on a number of object recognition databases. 

Another system that uses local affine frames is the one developed by Rothganger, Lazeb- 

16 When a larger number of features is available, a full fundamental matrix can be used (Brown and Lowe 2002; 
Gordon and Lowe 2006). When image stitching is being performed (Brown and Lowe 2007), the motion models 
discussed in Section 9.1 can be used instead. 
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(a) (b) (c) (d) 


Figure 14.27 3D object recognition with affine regions (Rothganger, Lazebnik, Schmid et 
al. 2006) © 2006 Springer: (a) sample input image; (b) five of the recognized (reprojected) 
objects along with their bounding boxes; (c) a few of the local affine regions; (d) local affine 
region (patch) reprojected into a canonical (square) frame, along with its geometric affine 
transformations. 

nik, Schmid et al. (2006). In their system, the affine region detector of Mikolajczyk and 
Schmid (2004) is used to rectify local image patches (Figure 14.27d), from which both a 
SIFT descriptor and a 10 x 10 UV color histogram are computed and used for matching 
and recognition. Corresponding patches in different views of the same object, along with 
their local affine deformations, are used to compute a 3D affine model for the object using 
an extension of the factorization algorithm of Section 7.3, which can then be upgraded to a 
Euclidean reconstruction (Tomasi and Kanade 1992). 

At recognition time, local Euclidean neighborhood constraints are used to filter potential 
matches, in a manner analogous to the affine geometric constraints used by Lowe (2004) and 
Obdrzalek and Matas (2006). Figure 14.27 shows the results of recognizing five objects in a 
cluttered scene using this approach. 

While feature-based approaches are normally used to detect and localize known objects in 
scenes, it is also possible to get pixel-level segmentations of the scene based on such matches. 
Ferrari, Tuytelaars, and Van Gool (2006b) describe such a system for simultaneously recog- 
nizing objects and segmenting scenes, while Kannala, Rahtu, Brandt et al. (2008) extend this 
approach to non-rigid deformations. Section 14.4.3 re-visits this topic of joint recognition 
and segmentation in the context of generic class (category) recognition. 

14.3.2 Large databases 

As the number of objects in the database starts to grow large (say, millions of objects or video 
frames being searched), the time it takes to match a new image against each database image 
can become prohibitive. Instead of comparing the images one at a time, techniques are needed 
to quickly narrow down the search to a few likely images, which can then be compared using 
a more detailed and conservative verification stage. 
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Figure 14.28 Visual words obtained from elliptical normalized affine regions (Sivic and 
Zisserman 2009) © 2009 IEEE, (a) Affine covariant regions are extracted from each frame 
and clustered into visual words using k-means clustering on SIFT descriptors with a learned 
Mahalanobis distance, (b) The central patch in each grid shows the query and the surrounding 
patches show the nearest neighbors. 


The problem of quickly finding partial matches between documents is one of the cen- 
tral problems in information retrieval (IR) (Baeza-Yates and Ribeiro-Neto 1999; Manning, 
Raghavan, and Schtitze 2008). The basic approach in fast document retrieval algorithms is to 
pre-compute an inverted index between individual words and the documents (or Web pages 
or news stories) where they occur. More precisely, th e, frequency of occurrence of particular 
words in a document is used to quickly find documents that match a particular query. 

Sivic and Zisserman (2009) were the first to adapt IR techniques to visual search. In their 
Video Google system, affine invariant features are first detected in all the video frames they 
are indexing using both shape adapted regions around Harris feature points (Schaffalitzky 
and Zisserman 2002; Mikolajczyk and Schmid 2004) and maximally stable extremal regions 
(Matas, Chum, Urban et al. 2004), (Section 4.1.1), as shown in Figure 14.28a. Next, 128- 
dimensional SIFT descriptors are computed from each normalized region (i.e., the patches 
shown in Figure 14.28b). Then, an average covariance matrix for these descriptors is es- 
timated by accumulating statistics for features tracked from frame to frame. The feature 
descriptor covariance S is then used to define a Mahalanobis distance between feature de- 
scriptors, 

d(x 0 ,tci) = ||a;o - *i|| s -i = \J (* 0 - *i) T S _1 (*o - xi). (14.32) 

In practice, feature descriptors are whitened by pre-multiplying them by Y 1 1,2 so that Eu- 
clidean distances can be used. 17 

In order to apply fast information retrieval techniques to images, the high-dimensional 
feature descriptors that occur in each image must first be mapped into discrete visual words. 

17 Note that the computation of feature covariances from matched feature points is much more sensible than simply 
performing a PCA on the descriptor space (Winder and Brown 2007). This corresponds roughly to the within-class 
scatter matrix (14.17) we studied in Section 14.2.1 . 
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Figure 14.29 Matching based on visual words (Sivic and Zisserman 2009) © 2009 IEEE, 
(a) Features in the query region on the left are matched to corresponding features in a highly 
ranked video frame, (b) Results after removing the stop words and filtering the results using 
spatial consistency. 


Sivic and Zisserman (2003) perform this mapping using k-means clustering, while some of 
newer methods discussed below (Nister and Stewenius 2006; Philbin, Chum, Isard et al. 
2007) use alternative techniques, such as vocabulary trees or randomized forests. To keep the 
clustering time manageable, only a few hundred video frames are used to leam the cluster 
centers, which still involves estimating several thousand clusters from about 300,000 descrip- 
tors. At visual query time, each feature in a new query region (e.g.. Figure 14.28a, which is 
a cropped region from a larger video frame) is mapped to its corresponding visual word. To 
keep very common patterns from contaminating the results, a stop list of the most common 
visual words is created and such words are dropped from further consideration. 

Once a query image or region has been mapped into its constituent visual words, likely 
matching images or video frames must then be retrieved from the database. Information 
retrieval systems do this by matching word distributions {term frequencies) riid/nd between 
the query and target documents, where riid is how many times word i occurs in document d, 
and ?id is the total number of words in document d. In order to downweight words that occur 
frequently and to focus the search on rarer (and hence, more informative) terms, an inverse 
document frequency weighting log N /N t is applied, where N, t is the number of documents 
containing word i, and N is the total number of documents in the database. The combination 
of these two factors results in the term frequency-inverse document frequency (tf-idf) measure, 


U = 


'H'id, 

n d 


log 


N 

Ni ' 


(14.33) 


At match time, each document (or query region) is represented by its tf-idf vector. 


f = (*!,. 


I'i ? ■ ■ • I'rri) • 


(14.34) 


The similarity between two documents is measured by the dot product between their corre- 
sponding normalized vectors t = t/||£||, which means that their dissimilarity is proportional 
to their Euclidean distance. In their journal paper, Sivic and Zisserman (2009) compare this 
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1 . Vocabulary construction (off-line) 

(a) Extract affine covariant regions from each database image. 

(b) Compute descriptors and optionally whiten them to make Euclidean dis- 
tances meaningful (Sivic and Zisserman 2009). 

(c) Cluster the descriptors into visual words, either using k-means (Sivic and 
Zisserman 2009), hierarchical clustering (Nister and Stewenius 2006), or 
randomized k-d trees (Philbin, Chum, Isard et al. 2007). 

(d) Decide which words are too common and put them in the stop list. 

2. Database construction (off-line) 

(a) Compute term frequencies for the visual word in each image, document fre- 
quencies for each word, and normalized tf-idf vectors for each document. 

(b) Compute inverted indices from visual words to images (with word counts). 

3. Image retrieval (on-line) 

(a) Extract regions, descriptors, and visual words, and compute a tf-idf vector 
for the query image or region. 

(b) Retrieve the top image candidates, either by exhaustively comparing sparse 
tf-idf vectors (Sivic and Zisserman 2009) or by using inverted indices to ex- 
amine only a subset of the images (Nister and Stewenius 2006). 

(c) Optionally re-rank or verify all the candidate matches, using either spatial 
consistency (Sivic and Zisserman 2009) or an affine (or simpler) transforma- 
tion model (Philbin, Chum, Isard et al. 2007). 

(d) Optionally expand the answer set by re-submitting highly ranked matches as 
new queries (Chum, Philbin, Sivic et al. 2007). 


Algorithm 14.2 Image retrieval using visual words (Sivic and Zisserman 2009; Nister and 
Stewenius 2006; Philbin, Chum, Isard et al. 2007; Chum, Philbin, Sivic et al. 2007; Philbin, 
Chum, Sivic et al. 2008). 
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simple metric to a dozen other metrics and conclude that it performs just about as well as 
more complicated metrics. Because the number of non-zero f, : terms in a typical query or 
document is small (M sa 200) compared to the number of visual words (V ~ 20, 000), the 
distance between pairs of (sparse) tf-idf vectors can be computed quite quickly. 

After retrieving the top N s = 500 documents based on word frequencies, Sivic and Zis- 
serman (2009) re-rank these results using spatial consistency. This step involves taking every 
matching feature and counting the number of k = 15 nearest adjacent features that also match 
between the two documents. (This latter process is accelerated using inverted files, which we 
discuss in more detail below.) As shown in Figure 14.29, this step helps remove spurious false 
positive matches and produces a better estimate of which frames and regions in the video are 
actually tme matches. Algorithm 14.2 summarizes the processing steps involved in image 
retrieval using visual words. 

While this approach works well for tens of thousand of visual words and thousands of 
keyframes, as the size of the database continues to increase, both the time to quantize each 
feature and to find potential matching frames or images can become prohibitive. Nister and 
Stewenius (2006) address this problem by constructing a hierarchical vocabulary tree , where 
feature vectors are hierarchically clustered into a k - way tree of prototypes. (This technique is 
also known as tree-structured vector quantization (Gersho and Gray 1991).) At both database 
construction time and query time, each descriptor vector is compared to several prototypes 
at a given level in the vocabulary tree and the branch with the closest prototype is selected 
for further refinement (Figure 14.30). In this way, vocabularies with millions (10 6 ) of words 
can be supported, which enables individual words to be far more discriminative, while only 
requiring 10 • 6 comparisons for quantizing each descriptor. 

At query time, each node in the vocabulary tree keeps its own inverted file index, so that 
features that match a particular node in the tree can be rapidly mapped to potential matching 
images. (Interior leaf nodes just use the inverted indices of their corresponding leaf-node 
descendants.) To score a particular query tf-idf vector t q against all document vectors { tj } 
using an L p metric, 18 the non-zero t iq entries in t q are used to fetch corresponding non-zero 
tij entries, and the L p norm is efficiently computed as 

\\t q -tj\\ p p = 2+ y, (i*w-*«r-i^r- i^n- (14.35) 

i\ti q >0Atij >0 


In order to mitigate quantization errors due to noise in the descriptor vectors, Nister and 
Stewenius (2006) not only score leaf nodes in the vocabulary tree (corresponding to visual 
words), but also score interior nodes in the tree, which correspond to clusters of similar visual 
words. 

18 In their actual implementation, Nister and Stewenius (2006) use an L i metric. 
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Figure 14.30 Scalable recognition using a vocabulary tree (Nister and Stewenius 2006) © 
2006 IEEE, (a) Each MSER elliptical region is converted into a SIFT descriptor, which is 
then quantized by comparing it hierarchically to some prototype descriptors in a vocabulary 
tree. Each leaf node stores its own inverted index (sparse list of non-zero tf-idf counts) into 
images that contain that feature, (b) A recognition result, showing a query image (top row) 
being indexed into a database of 6000 test images and correctly finding the corresponding 
four images. 


Because of the high efficiency in both quantizing and scoring features, their vocabulary- 
tree-based recognition system is able to process incoming images in real time against a 
database of 40,000 CD covers and at 1Hz when matching a database of one million frames 
taken from six feature-length movies. Figure 14.30b shows some typical images from the 
database of objects taken under varying viewpoints and illumination that was used to train 
and test the vocabulary tree recognition system. 

The state of the art in instance recognition continues to improve rapidly. Philbin, Chum, 
Isard el al. (2007) have shown that randomized forest of k-d trees perform better than vocabu- 
lary trees on a large location recognition task (Figure 14.3 1). They also compare the effects of 
using different 2D motion models (Section 2.1.2) in the verification stage. In follow-on work. 
Chum, Philbin, Sivic et al. (2007) apply another idea from information retrieval, namely 
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Figure 14.31 Location or building recognition using randomized trees (Philbin, Chum, Isard 
et al. 2007) © 2007 IEEE. The left image is the query, the other images are the highest-ranked 
results. 


query expansion , which involves re-submitting top-ranked images from the initial query as 
additional queries to generate additional candidate results, to further improve recognition 
rates for difficult (occluded or oblique) examples. Philbin, Chum, Sivic el al. (2008) show 
how to mitigate quantization problems in visual words selection using soft assignment , where 
each feature descriptor is mapped to a number of visual words based on its distance from the 
cluster prototypes. The soft weights derived from these distances are used, in turn, to weight 
the counts used in the tf-idf vectors and to retrieve additional images for later verification. 
Taken together, these recent advances hold the promise of extending current instance recog- 
nition algorithms to performing Web-scale retrieval and matching tasks (Agarwal, Snavely, 
Simon et al. 2009; Agarwal, Furukawa, Snavely et al. 2010; Snavely, Simon, Goesele et al. 
2010 ). 

14.3.3 Application : Location recognition 

One of the most exciting applications of instance recognition today is in the area of location 
recognition, which can be used both in desktop applications (where did I take this holiday 
snap?) and in mobile (cell-phone) applications. The latter case includes not only finding out 
your current location based on a cell-phone image but also providing you with navigation 
directions or annotating your images with useful information, such as building names and 
restaurant reviews (i.e., a portable form of augmented reality). 

Some approaches to location recognition assume that the photos consist of architectural 
scenes for which vanishing directions can be used to pre-rectify the images for easier match- 
ing (Robertson and Cipolla 2004). Other approaches use general affine covariant interest 
points to perform wide baseline matching (Schaffalitzky and Zisserman 2002). The Photo 
Tourism system of Snavely, Seitz, and Szeliski (2006) (Section 13.1.2) was the first to apply 
these kinds of ideas to large-scale image matching and (implicit) location recognition from 
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(a) (b) (c) 


Figure 14.32 Feature-based location recognition (Schindler, Brown, and Szeliski 2007) © 
2007 IEEE: (a) three typical series of overlapping street photos; (b) handheld camera shots 
and (c) their corresponding database photos. 


Internet photo collections taken under a wide variety of viewing conditions. 

The main difficulty in location recognition is in dealing with the extremely large commu- 
nity (user-generated) photo collections on Web sites such as Flickr (Philbin, Chum, Isard et 
al. 2007; Chum, Philbin, Sivic et al. 2007; Philbin, Chum, Sivic et al. 2008; Turcot and Lowe 
2009) or commercially captured databases (Schindler, Brown, and Szeliski 2007). The preva- 
lence of commonly appearing elements such as foliage, signs, and common architectural ele- 
ments further complicates the task. Figure 14.31 shows some results on location recognition 
from community photo collections, while Figure 14.32 shows sample results from denser 
commercially acquired datasets. In the latter case, the overlap between adjacent database 
images can be used to verify and prune potential matches using “temporal” filtering, i.e., re- 
quiring the query image to match nearby overlapping database images before accepting the 
match. 

Another variant on location recognition is the automatic discovery of landmarks, i.e., 
frequently photographed objects and locations. Simon, Snavely, and Seitz (2007) show how 
these kinds of objects can be discovered simply by analyzing the matching graph constructed 
as part of the 3D modeling process in Photo Tourism. More recent work has extended this 
approach to larger data sets using efficient clustering techniques (Philbin and Zisserman 2008; 
Li, Wu, Zach et al. 2008; Chum, Philbin, and Zisserman 2008; Chum and Matas 2010) as well 
as combining meta-data such as GPS and textual tags with visual search (Quack, Leibe, and 
Van Gool 2008; Crandall, Backstrom, Huttenlocher et al. 2009), as shown in Figure 14.33. 
It is now even possible to automatically associate object tags with images based on their co- 
occurrence in multiple loosely tagged images (Simon and Seitz 2008; Gammeter, Bossard, 
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Figure 14.33 Automatic mining, annotation, and localization of community photo collec- 
tions (Quack, Leibe, and Van Gool 2008) © 2008 ACM. This figure does not show the textual 
annotations or corresponding Wikipedia entries, which are also discovered. 



(a) (b) 


Figure 14.34 Locating star fields using astrometry, http://astrometry.net/. (a) Input star field 
and some selected star quads, (b) The 2D coordinates of stars C and D are encoded relative 
to the unit square defined by A and B. 


Quack et al. 2009). 

The concept of organizing the world’s photo collections by location has even been re- 
cently extended to organizing all of the universe’s (astronomical) photos in an application 
called astrometry , http://astrometry.net/. The technique used to match any two star fields is 
to take quadruplets of nearby stars (a pair of stars and another pair inside their diameter) to 
form a 30-bit geometric hash by encoding the relative positions of the second pair of points 
using the inscribed square as the reference frame, as shown in Figure 14.34. Traditional in- 
formation retrieval techniques (k-d trees built for different parts of a sky atlas) are then used 
to find matching quads as potential star field location hypotheses, which can then be verified 
using a similarity transform. 
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Figure 14.35 Sample images from the Xerox 10 class dataset (Csurka, Dance, Perronnin et 
al. 2006) © 2007 Springer. Imagine trying to write a program to distinguish such images 
from other photographs. 

14.4 Category recognition 

While instance recognition techniques are relatively mature and are used in commercial ap- 
plications, such as Photosynth (Section 13.1.2), generic category (class) recognition is still 
a largely unsolved problem. Consider for example the set of photographs in Figure 14.35, 
which shows objects taken from 10 different visual categories. (I’ll leave it up to you to name 
each of the categories.) How would you go about writing a program to categorize each of 
these images into the appropriate class, especially if you were also given the choice “none of 
the above”? 

As you can tell from this example, visual category recognition is an extremely challenging 
problem; no one has yet constructed a system that approaches the performance level of a two- 
year-old child. However, the progress in the field has been quite dramatic, if judged by how 
much better today’s algorithms are compared to those of a decade ago. 

Figure 14.54 shows a sample image from each of the 20 categories used in the 2008 
PASCAL Visual Object Classes Challenge. The yellow boxes represent the extent of each of 
the objects found in a given image. On such closed world collections where the task is to 
decide among 20 categories, today’s classification algorithms can do remarkably well. 
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Figure 14.36 A typical processing pipeline for a bag-of-words category recognition system 
(Csurka, Dance, Perronnin et al. 2006) © 2007 Springer. Features are first extracted at 
keypoints and then quantized to get a distribution (histogram) over the learned visual words 
(feature cluster centers). The feature distribution histogram is used to learn a decision surface 
using a classification algorithm, such as a support vector machine. 


In this section, we look at a number of approaches to solving category recognition. While 
historically, part-based representations and recognition algorithms (Section 14.4.2) were the 
preferred approach (Fischler and Elschlager 1973; Felzenszwalb and Huttenlocher 2005; 
Fergus, Perona, and Zisserman 2007), we begin by describing simpler bag-of-features ap- 
proaches (Section 14.4.1) that represent objects and images as unordered collections of fea- 
ture descriptors. We then look at the problem of simultaneously segmenting images while 
recognizing objects (Section 14.4.3) and also present some applications of such techniques to 
photo manipulation (Section 14.4.4). In Section 14.5, we look at how context and scene un- 
derstanding, as well as machine learning, can improve overall recognition results. Additional 
details on the techniques presented in this section can be found in (Pinz 2005; Ponce, Hebert, 
Schmid et al. 2006; Dickinson, Leonardis, Schiele el al. 2007; Fei-Fei, Fergus, and Torralba 
2009). 

14.4.1 Bag of words 

One of the simplest algorithms for category recognition is the bag of words (also known as 
bag of features or bag of keypoints) approach (Csurka, Dance, Fan et al. 2004; Lazebnik, 
Schmid, and Ponce 2006; Csurka, Dance, Perronnin et al. 2006; Zhang, Marszalek, Lazeb- 
nik et al. 2007). As shown in Figure 14.36, this algorithm simply computes the distribu- 
tion (histogram) of visual words found in the query image and compares this distribution 
to those found in the training images. We have already seen elements of this approach in 
Section 14.3.2, Equations (14.33-14.35) and Algorithm 14.2. The biggest difference from 
instance recognition is the absence of a geometric verification stage (Section 14.3.1), since 
individual instances of generic visual categories, such as those shown in Figure 14.35, have 
relatively little spatial coherence to their features (but see the work by Lazebnik, Schmid, and 
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Ponce (2006)). 

Csurka, Dance, Fan et al. (2004) were the first to use the term bag of keypoints to describe 
such approaches and among the first to demonstrate the utility of frequency-based techniques 
for category recognition. Their original system used affine covariant regions and SIFT de- 
scriptors, k-means visual vocabulary construction, and both a naive Bayesian classifier and 
support vector machines for classification. (The latter was found to perform better.) Their 
newer system (Csurka, Dance, Perronnin et al. 2006) uses regular (non-affine) SIFT patches, 
boosting instead of SVMs, and incorporates a small amount of geometric consistency infor- 
mation. 

Zhang, Marszalek, Lazebnik et al. (2007) perform a more detailed study of such bag of 
features systems. They compare a number of feature detectors (Harris-Laplace (Mikolajczyk 
and Schmid 2004) and Laplacian (Lindeberg 1998b)), descriptors (SIFT, RIFT, and SPIN 
(Lazebnik, Schmid, and Ponce 2005)), and SVM kernel functions. To estimate distances for 
the kernel function, they form an image signature 

S = ((t 1 ,m 1 ),...,(t rn ,m m )), (14.36) 


analogous to the tf-idf vector t in (14.34), where the cluster centers rrij are made explicit. 
They then investigate two different kernels for comparing such image signatures. The first is 
the earth mover’s distance (EMD) (Rubner, Tomasi, and Guibas 2000), 


EMD(S, S') 




"Yhi fi ; 


(14.37) 


where f jj is a flow value that can be computed using a linear program and dim,, m' ) is the 
ground distance (Euclidean distance) between m,; and m / . Note that the EMD can be used 
to compare two signatures of different lengths, where the entries do not need to correspond. 
The second is a \ 2 distance 


X 2 (S,S') 


1 (U -t'f) 2 

2 i U + t'i 


(14.38) 


which measures the likelihood that the two signatures were generated from consistent random 
processes. These distance metrics are then converted into SVM kernels using a generalized 
Gaussian kernel 

A'(S,S') = exp(-i 

where A is a scaling parameter set to the mean distance between training images. In their 
experiments, they find that the EMD works best for visual category recognition and the \ 2 
measure is best for texture recognition. 


D(S,S')), 


(14.39) 


14.4 Category recognition 


699 



Figure 14.37 Comparing collections of feature vectors using pyramid matching, (a) The 
feature-space pyramid match kernel (Grauman and Darrell 2007b) constructs a pyramid in 
high-dimensional feature space and uses it to compute distances (and implicit correspon- 
dences) between sets of feature vectors, (b) Spatial pyramid matching (Lazebnik, Schmid, 
and Ponce 2006) © 2006 IEEE divides the image into a pyramid of pooling regions and 
computes separate visual word histograms (distributions) inside each spatial bin. 


Instead of quantizing feature vectors to visual words, Grauman and Darrell (2007b) de- 
velop a technique for directly computing an approximate distance between two variably sized 
collections of feature vectors. Their approach is to bin the feature vectors into a multi- 
resolution pyramid defined in feature space (Figure 14.37a) and count the number of features 
that land in corresponding bins B,j and B' u (Figure 14.38a-c). The distance between the two 
sets of feature vectors (which can be thought of as points in a high-dimensional space) is 
computed using histogram intersection between corresponding bins 


<3i = ^min (BiuB'n) 
i 


(14.40) 


(Figure 14.38d). These per-level counts are then summed up in a weighted fashion 


Da = ^ wiNi with TV; = C) — Cj_i and wi 

i 


1 

d2 l 


(14.41) 


(Figure 14.38e), which discounts matches already found at finer levels while weighting finer 
matches more heavily. ( d is the dimension of the embedding space, i.e., the length of the 
feature vectors.) In follow-on work, Grauman and Darrell (2007a) show how an explicit 
construction of the pyramid can be avoided using hashing techniques. 

Inspired by this work, Fazebnik, Schmid, and Ponce (2006) show how a similar idea 
can be employed to augment bags of keypoints with loose notions of 2D spatial location 
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Figure 14.38 A one-dimensional illustration of comparing collections of feature vectors 
using the pyramid match kernel (Grauman and Darnell 2007b): (a) distribution of feature 
vectors (point sets) into the pyramidal bins; (b-c) histogram of point counts in bins Bn and 
B' u for the two images; (d) histogram intersections (minimum values); (e) per-level similarity 
scores, which are weighted and summed to form the final distance/similarity metric. 


analogous to the pooling performed by SIFT (Lowe 2004) and “gist” (Torralba, Murphy, 
Freeman et al. 2003). In their work, they extract affine region descriptors (Lazebnik, Schmid, 
and Ponce 2005) and quantize them into visual words. (Based on previous results by Fei-Fei 
and Perona (2005), the feature descriptors are extracted densely (on a regular grid) over the 
image, which can be helpful in describing textureless regions such as the sky.) They then form 
a spatial pyramid of bins containing word counts (histograms), as shown in Figure 14.37b, and 
use a similar pyramid match kernel to combine histogram intersection counts in a hierarchical 
fashion. 

The debate about whether to use quantized feature descriptors or continuous descriptors 
and also whether to use sparse or dense features continues to this day. Boiman, Shechtman, 
and Irani (2008) show that if query images are compared to all the features representing a 
given class, rather than just each class image individually, nearest-neighbor matching fol- 
lowed by a naive Bayes classifier outperforms quantized visual words (Figure 14.39). In- 
stead of using generic feature detectors and descriptors, some authors have been investigat- 
ing learning class-specific features (Ferencz, Learned-Miller, and Malik 2008), often using 
randomized forests (Philbin, Chum, Isard el al. 2007; Moosmann, Nowak, and Jurie 2008; 
Shotton, Johnson, and Cipolla 2008) or combining the feature generation and image classi- 
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Figure 14.39 “Image-to-Image” vs. “Image-to-Class” distance comparison (Boiman, 
Shechtman, and Irani 2008) © 2008 IEEE. The query image on the upper left may not match 
the feature distribution of any of the database images in the bottom row. However, if each 
feature in the query is matched to its closest analog in all the class images, a good match can 
be found. 


fication stages (Yang, Jin, Sukthankar et al. 2008). Others, such as Serre, Wolf, and Poggio 
(2005) and Mutch and Lowe (2008) use hierarchies of dense feature transforms inspired by 
biological (visual cortical) processing combined with SVMs for final classification. 


14.4.2 Part-based models 

Recognizing an object by finding its constituent parts and measuring their geometric rela- 
tionships is one of the oldest approaches to object recognition (Fischler and Elschlager 1973; 
Kanade 1977; Yuille 1991). We have already seen examples of part-based approaches being 
used for face recognition (Figure 14.18) (Moghaddam and Pentland 1997; Heisele, Ho, Wu 
et al. 2003; Heisele, Serre, and Poggio 2007) and pedestrian detection (Figure 14.9) (Felzen- 
szwalb, McAllester, and Ramanan 2008). 

In this section, we look more closely at some of the central issues in part-based recog- 
nition, namely, the representation of geometric relationships, the representation of individ- 
ual parts, and algorithms for learning such descriptions and recognizing them at run time. 
More details on part-based models for recognition can be found in the course notes of Fergus 
(2007b, 2009). 

The earliest approaches to representing geometric relationships were dubbed pictorial 
structures by Fischler and Elschlager ( 1973) and consisted of spring-like connections between 
different feature locations (Figure 14.1a). To fit a pictorial structure to an image, an energy 
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Figure 14.40 Using pictorial structures to locate and track a person (Felzenszwalb and Hut- 
tenlocher 2005) © 2005 Springer. The structure consists of articulated rectangular body parts 
(torso, head, and limbs) connected in a tree topology that encodes relative part positions and 
orientations. To fit a pictorial structure model, a binary silhouette image is first computed 
using background subtraction. 


function of the form 

e = J2 Vi(ii) + v aM) 04.42) 

i ij£E 

is minimized over all potential part locations or poses {7;} and pairs of parts (i,j) for which 
an edge (geometric relationship) exists in E. Note how this energy is closely related to 
that used with Markov random fields (3.108-3.109), which can be used to embed pictorial 
structures in a probabilistic framework that makes parameter learning easier (Felzenszwalb 
and Huttenlocher 2005). 

Part-based models can have different topologies for the geometric connections between 
the parts (Figure 14.41). For example, Felzenszwalb and Huttenlocher (2005) restrict the 
connections to a tree (Figure 14.41d), which makes learning and inference more tractable. A 
tree topology enables the use of a recursive Viterbi (dynamic programming) algorithm (Pearl 
1988; Bishop 2006), in which leaf nodes are first optimized as a function of their parents, and 
the resulting values are then plugged in and eliminated from the energy function — see Ap- 
pendix B.5.2. The Viterbi algorithm computes an optimal match in 0(N 2 \E\ + NP) time, 
where N is the number of potential locations or poses for each part, \E\ is the number of 
edges (pairwise constraints), and P = \V\ is the number of parts (vertices in the graphical 
model, which is equal to E\ + 1 in a tree). To further increase the efficiency of the infer- 
ence algorithm, Felzenszwalb and Huttenlocher (2005) restrict the pairwise energy functions 
Vij(li,lj) to be Mahalanobis distances on functions of location variables and then use fast 
distance transform algorithms to minimize each pairwise interaction in time that is closer to 
linear in N. 

Figure 14.40 shows the results of using their pictorial structures algorithm to fit an articu- 
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Figure 14.41 Graphical models for geometric spatial priors (Carneiro and Lowe 2006) © 
2006 Springer: (a) constellation (Fergus, Perona, and Zisserman 2007); (b) star (Crandall, 
Felzenszwalb, and Huttenlocher 2005; Fergus, Perona, and Zisserman 2005); (c) /c-fan (k = 
2) (Crandall, Felzenszwalb, and Huttenlocher 2005); (d) tree (Felzenszwalb and Huttenlocher 
2005); (e) bag of features (Csurka, Dance, Fan et al. 2004); (f) hierarchy (Bouchard and 
Triggs 2005); (g) sparse flexible model (Carneiro and Lowe 2006). 


lated body model to a binary image obtained by background segmentation. In this application 
of pictorial structures, parts are parameterized by the locations, sizes, and orientations of their 
approximating rectangles. Unary matching potentials V^(Zj) are determined by counting the 
percentage of foreground and background pixels inside and just outside the tilted rectangle 
representing each part. 

Over the last decade, a large number of different graphical models have been proposed 
for part-based recognition, as shown in Figure 14.41. Carneiro and Lowe (2006) discuss 
a number of these models and propose one of their own, which they call a sparse flexible 
model, it involves ordering the parts and having each part’s location depend on at most k of 
its ancestor locations. 

The simplest models, which we saw in Section 14.4.1, are bags of words, where there are 
no geometric relationships between different parts or features. While such models can be very 
efficient, they have a very limited capacity to express the spatial arrangement of parts. Trees 
and stars (a special case of trees where all leaf nodes are directly connected to a common root) 
are the most efficient in terms of inference and hence also learning (Felzenszwalb and Hutten- 
locher 2005; Fergus, Perona, and Zisserman 2005; Felzenszwalb, Me Allester, and Ramanan 
2008). Directed acyclic graphs (Figure 14.41f-g) come next in terms of complexity and can 
still support efficient inference, although at the cost of imposing a causal structure on the 
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part model (Bouchard and Triggs 2005; Carneiro and Lowe 2006). fc-fans, in which a clique 
of size k forms the root of a star-shaped model (Figure 14.41c) have inference complexity 
0(N k+1 ), although with distance transforms and Gaussian priors, this can be lowered to 
0(N k ) (Crandall, Felzenszwalb, and Huttenlocher 2005; Crandall and Fluttenlocher 2006). 
Finally, fully connected constellation models (Figure 14.41a) are the most general, but the 
assignment of features to parts becomes intractable for moderate numbers of parts P, since 
the complexity of such an assignment is 0(N P ) (Fergus, Perona, and Zisserman 2007). 

The original constellation model was developed by Burl, Weber, and Perona (1998) and 
consists of a number of parts whose relative positions are encoded by their mean locations 
and a full covariance matrix, which is used to denote not only positional uncertainty but also 
potential correlations (covariance) between different parts (Figure 14.42a). Weber, Welling, 
and Perona (2000) extended this technique to a weakly supervised setting, where both the 
appearance of each part and its locations are automatically learned given only whole image 
labels. Fergus, Perona, and Zisserman (2007) further extend this approach to simultaneous 
learning of appearance and shape models from scale-invariant keypoint detections. 

Figure 14.42a shows the shape model learned for the motorcycle class. The top figure 
shows the mean relative locations for each part along with their position covariances (inter- 
part covariances are not shown) and likelihood of occurrence. The bottom curve shows the 
Gaussian PDFs for the relative log-scale of each part with respect to the “landmark” feature. 
Figure 14.42b shows the appearance model learned for each part, visualized as the patches 
around detected features in the training database that best match the appearance model. Fig- 
ure 14.42c shows the features detected in the test database (pink dots) along with the corre- 
sponding parts that they were assigned to (colored circles). As you can see, the system has 
successfully learned and then used a fairly complex model of motorcycle appearance. 

The part-based approach to recognition has also been extended to learning new categories 
from small numbers of examples, building on recognition components developed for other 
classes (Fei-Fei, Fergus, and Perona 2006). More complex hierarchical part-based models can 
be developed using the concept of grammars (Bouchard and Triggs 2005; Zhu and Mumford 
2006). A simpler way to use parts is to have keypoints that are recognized as being part of 
a class vote for the estimated part locations, as shown in the top row of Figure 14.43 (Leibe, 
Leonardis, and Schiele 2008). (Implicitly, this corresponds to having a star-shaped geometric 
model.) 


14.4.3 Recognition with segmentation 

The most challenging version of generic object recognition is to simultaneously perform 
recognition with accurate boundary segmentation (Fergus 2007a). For instance recognition 
(Section 14.3.1), this can sometimes be achieved by backprojecting the object model into 
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Figure 14.42 Part-based recognition (Fergus, Perona, and Zisserman 2007) © 2007 
Springer: (a) locations and covariance ellipses for each part, along with their occurrence 
probabilities (top) and relative log-scale densities (bottom); (b) part examples drawn from 
the training images that best match the average appearance; (c) recognition results for the 
motorcycle class, showing detected features (pink dots) and parts (colored circles). 
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Original Image Interest Points Matched Codebook Probabilistic 



Refined Hypotheses Backprojected Backprojection 

(optional) Hypotheses of Maxima 


Figure 14.43 Interleaved recognition and segmentation (Leibe, Leonardis, and Schiele 
2008) © 2008 Springer. The process starts by re-recognizing visual words (codebook en- 
tries) in a new image (scene) and having each part vote for likely locations and size in a 
3D (x, y, s) voting space (top row). Once a maximum has been found, the parts (features) 
corresponding to this instance are determined by backprojecting the contributing votes. The 
foreground-background segmentation for each object can be found by backprojecting proba- 
bilistic masks associated with each codebook entry. The whole recognition and segmentation 
process can then be repeated. 


the scene (Lowe 2004), as shown in Figure 14. Id, or matching portions of the new scene to 
pre -learned (segmented) object models (Ferrari, Tuytelaars, and Van Gool 2006b; Kannala, 
Rahtu, Brandt et al. 2008). 

For more complex (flexible) object models, such as those for humans Figure 14. If, a 
different approach is to pre-segment the image into larger or smaller pieces (Chapter 5) and 
then match such pieces to portions of the model (Mori, Ren, Efros et al. 2004; Mori 2005; 
He, Zemel, and Ray 2006; Gu, Lim, Arbelaez et al. 2009). 

An alternative approach by Leibe, Leonardis, and Schiele (2008), which we introduced 
in the previous section, votes for potential object locations and scales based on the detec- 
tion of features corresponding to pre-clustered visual codebook entries (Figure 14.43). To 
support segmentation, each codebook entry has an associated foreground-background mask, 
which is learned as part of the codebook clustering process from pre-labeled object segmen- 
tation masks. During recognition, once a maximum in the voting space is found, the masks 
associated with the entries that voted for this instance are combined to obtain an object seg- 
mentation, as shown on the left side of Figure 14.43. 

A more holistic approach to recognition and segmentation is to formulate the problem as 
one of labeling every pixel in an image with its class membership, and to solve this prob- 
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Figure 14.44 Simultaneous recognition and segmentation using TextonBoost (Shotton, 
Winn, Rother el al. 2009) © 2009 Springer: (a) successful recognition results; (b) less suc- 
cessful results. 
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Consistent Foreground (SWS) Instance Occlusion 
(2W3) Object Edge 


Figure 14.45 Layout consistent random field (Winn and Shotton 2006) © 2006 IEEE. The 
numbers indicate the kind of neighborhood relations that can exist between pixels assigned 
to the same or different classes. Each pairwise relationship carries its own likelihood (energy 
penalty). 


lem using energy minimization or Bayesian inference techniques, i.e., conditional random 
fields (Section 3.7.2, (3.118)) (Kumar and Hebert 2006; He, Zemel, and Carreira-Perpinan 
2004). The TextonBoost system of Shotton, Winn, Rother et al. (2009) uses unary (pixel- 
wise) potentials based on image-specific color distributions (Section 5.5) (Boykov and Jolly 
2001; Rother, Kolmogorov, and Blake 2004), location information (e.g., foreground objects 
are more likely to be in the middle of the image, sky is likely to be higher, and road is likely 
to be lower), and novel texture-layout classifiers trained using shared boosting. It also uses 
traditional pairwise potentials that look at image color gradients (Veksler 2001; Boykov and 
Jolly 2001; Rother, Kolmogorov, and Blake 2004). The texton-layout features first filter the 
image with a series of 17 oriented filter banks and then cluster the responses to classify each 
pixel into 30 different texton classes (Malik, Belongie, Leung el al. 2001). The responses 
are then filtered using offset rectangular regions trained with joint boosting (Viola and Jones 
2004) to produce the texton-layout features used as unary potentials. 

Figure 14.44a shows some examples of images successfully labeled and segmented using 
TextonBoost, while Figure 14.44b shows examples where it does not do as well. As you can 
see, this kind of semantic labeling can be extremely challenging. 

The TextonBoost conditional random field framework has been extended to LayoutCRFs 
by Winn and Shotton (2006), who incorporate additional constraints to recognize multiple 
object instances and deal with occlusions (Figure 14.45), and even more recently by Hoiem, 
Rother, and Winn (2007) to incorporate full 3D models. 

Conditional random fields continue to be widely used and extended for simultaneous 
recognition and segmentation applications (Kumar and Hebert 2006; He, Zemel, and Ray 
2006; Levin and Weiss 2006; Verbeek and Triggs 2007; Yang, Meer, and Foran 2007; Rabi- 
novich, Vedaldi, Galleguillos el al. 2007; Batra, Sukthankar, and Chen 2008; Larlus and Jurie 
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(a) (b) (c) (d) 


Figure 14.46 Scene completion using millions of photographs (Hays and Efros 2007) © 
2007 ACM: (a) original image; (b) after unwanted foreground removal; (c) plausible scene 
matches, with the one the user selected highlighted in red; (d) output image after replacement 
and blending. 


2008; He and Zemel 2008; Kumar, Ton - , and Zisserman 2010), producing some of the best 
results on the difficult PASCAL VOC segmentation challenge (Shotton, Johnson, and Cipolla 
2008; Kohli, Ladicky, and Torr 2009). Approaches that first segment the image into unique 
or multiple segmentations (Borenstein and Ullman 2008; He, Zemel, and Ray 2006; Russell, 
Efros, Sivic et al. 2006) (potentially combined with CRF models) also do quite well: Csurka 
and Perronnin (2008) have one of the top algorithms in the VOC segmentation challenge. 
Hierarchical (multi-scale) and grammar (parsing) models are also sometimes used (Tu, Chen, 
Yuille et al. 2005; Zhu, Chen, Lin et al. 2008). 


14.4.4 Application : Intelligent photo editing 

Recent advances in object recognition and scene understanding have greatly increased the 
power of intelligent (semi-automated) photo editing applications. One example is the Photo 
Clip Art system of Lalonde, Hoiem, Efros et al. (2007), which recognizes and segments 
objects of interest, such as pedestrians, in Internet photo collections and then allows users to 
paste them into their own photos. Another is the scene completion system of Hays and Efros 
(2007), which tackles the same inpainting problem we studied in Section 10.5. Given an 
image in which we wish to erase and fill in a large section (Figure 14.46a-b), where do you 
get the pixels to fill in the gaps in the edited image? Traditional approaches either use smooth 
continuation (Bertalmio, Sapiro, Caselles et al. 2000) or borrowing pixels from other parts of 
the image (Efros and Leung 1999; Criminisi, Perez, and Toyama 2004; Efros and Freeman 
2001). With the advent of huge repositories of images on the Web (a topic we return to in 
Section 14.5.1), it often makes more sense to find a different image to serve as the source of 
the missing pixels. 

In their system. Hays and Efros (2007) compute the gist of each image (Oliva and Tor- 
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(a) (b) (c) (d) (e) 


Figure 14.47 Automatic photo pop-up (Hoiem, Efros, and Hebert 2005a) © 2005 ACM: 
(a) input image; (b) superpixels are grouped into (c) multiple regions; (d) labelings indicating 
ground (green), vertical (red), and sky (blue); (e) novel view of resulting piecewise-planar 3D 
model. 


ralba 2001; Torralba, Murphy, Freeman et al. 2003) to find images with similar colors and 
composition. They then run a graph cut algorithm that minimizes image gradient differences 
and composite the new replacement piece into the original image using Poisson image blend- 
ing (Section 9.3.4) (Perez, Gangnet, and Blake 2003). Figure 14.46d shows the resulting 
image with the erased foreground rooftops region replaced with sailboats. 

A different application of image recognition and segmentation is to infer 3D structure 
from a single photo by recognizing certain scene structures. For example, Criminisi, Reid, 
and Zisserman (2000) detect vanishing points and have the user draw basic structures, such 
as walls, in order infer the 3D geometry (Section 6.3.3). Hoiem, Efros, and Hebert (2005a) 
on the other hand, work with more “organic” scenes such as the one shown in Figure 14.47. 
Their system uses a variety of classifiers and statistics learned from labeled images to classify 
each pixel as either ground, vertical, or sky (Figure 14.47d). To do this, they begin by com- 
puting superpixels (Figure 14.47b) and then group them into plausible regions that are likely 
to share similar geometric labels (Figure 14.47c). After all the pixels have been labeled, the 
boundaries between the vertical and ground pixels can be used to infer 3D lines along which 
the image can be folded into a “pop-up” (after removing the sky pixels), as shown in Fig- 
ure 14.47e. In related work, Saxena, Sun, and Ng (2009) develop a system that directly infers 
the depth and orientation of each pixel instead of using just three geometric class labels. 

Face detection and localization can also be used in a variety of photo editing applications 
(in addition to being used in-camera to provide better focus, exposure, and flash settings). 
Zanella and Fuentes (2004) use active shape models (Section 14.2.2) to register facial features 
for creating automated morphs. Rother, Bordeaux, Hamadi et al. (2006) use face and sky 
detection to determine regions of interest in order to decide which pieces from a collection 
of images to stitch into a collage. Bitouk, Kumar, Dhillon et al. (2008) describe a system 
that matches a given face image to a large collection of Internet face images, which can 
then be used (with careful relighting algorithms) to replace the face in the original image. 
Applications they describe include de-identification and getting the best possible smile from 
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(a) (b) (c) (d) (e) 

Figure 14.48 The importance of context (images courtesy of Antonio Torralba). Can you 
name all of the objects in images (a-b), especially those that are circled in (c— d). Look 
carefully at the circled objects. Did you notice that they all have the same shape (after being 
rotated), as shown in column (e)? 


everyone in a “burst mode” group shot. Leyvand, Cohen-Or, Dror et al. (2008) show how 
accurately locating facial features using an active shape model (Cootes, Edwards, and Taylor 
2001; Zhou, Gu, and Zhang 2003) can be used to warp such features (and hence the image) 
towards configurations resembling those found in images whose facial attractiveness was 
highly rated, thereby “beautifying” the image without completely losing a person’s identity. 

Most of these techniques rely either on a set of labeled training images, which is an 
essential component of all learning techniques, or the even more recent explosion in images 
available on the Internet. The assumption in some of this work (and in recognition systems 
based on such very large databases (Section 14.5.1)) is that as the collection of accessible (and 
potentially partially labeled) images gets larger, finding a close match gets easier. As Hays 
and Efros (2007) state in their abstract “Our chief insight is that while the space of images is 
effectively infinite, the space of semantically differentiable scenes is actually not that large.” 
In an interesting commentary on their paper, Levoy (2008) disputes this assertion, claiming 
that “features in natural scenes form a heavy-tailed distribution, meaning that while some 
features in photographs are more common than others, the relative occurrence of less common 
features drops slowly. In other words, there are many unusual photographs in the world.” He 
does, however agree that in computational photography, as in many other applications such 
as speech recognition, synthesis, and translation, “simple machine learning algorithms often 
outperform more sophisticated ones if trained on large enough databases.” He also goes on 
to point out both the potential advantages of such systems, such as better automatic color 
balancing, and potential issues and pitfalls with the kind of image fakery that these new 
approaches enable. 

For additional examples of photo editing and computational photography applications 
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Figure 14.49 More examples of context: read the letters in the first group, the numbers in 
the second, and the letters and numbers in the third. (Images courtesy of Antonio Torralba.) 

enabled by Internet computer vision, please see recent workshops on this topic, 19 as well as 
the special journal issue (Avidan, Baker, and Shan 2010), and the course on Internet Vision 
by Tamara Berg (2008). 

14.5 Context and scene understanding 

Thus far, we have mostly considered the task of recognizing and localizing objects in isolation 
from that of understanding the scene (context) in which the object occur. This is a severe 
limitation, as context plays a very important role in human object recognition (Oliva and 
Torralba 2007). As we will see in this section, it can greatly improve the performance of 
object recognition algorithms (Divvala, Hoiem, Hays et al. 2009), as well as providing useful 
semantic clues for general scene understanding (Torralba 2008). 

Consider the two photographs in Figure 14.48a-b. Can you name all of the objects, 
especially those circled in images (c-d)? Now have a closer look at the circled objects. 
Do see any similarity in their shapes? In fact, if you rotate them by 90°, they are all the 
same as the “blob” shown in Figure 14.48e. So much for our ability to recognize object by 
their shape! Another (perhaps more artificial) example of recognition in context is shown in 
Figure 14.49. Try to name all of the letters and numbers, and then see if you guessed right. 

Even though we have not addressed context explicitly earlier in this chapter, we have 
already seen several instances of this general idea being used. A simple way to incorporate 
spatial information into a recognition algorithm is to compute feature statistics over different 
regions, as in the spatial pyramid system of Lazebnik, Schmid, and Ponce (2006). Part-based 
models (Section 14.4.2, Figures 14.40-14.43), use a kind of local context, where various parts 
need to be arranged in a proper geometric relationship to constitute an object. 

The biggest difference between part-based and context models is that the latter combine 
objects into scenes and the number of constituent objects from each class is not known in 
advance. In fact, it is possible to combine part-based and context models into the same 

19 


http://www.internetvisioner.org/. 
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Figure 14.50 Contextual scene models for object recognition (Sudderth, Torralba, Freeman 
et al. 2008) © 2008 Springer: (a) some street scenes and their corresponding labels (magenta 
= buildings, red = cars, green = trees, blue = road); (b) some office scenes (red = computer 
screen, green = keyboard, blue = mouse); (c) learned contextual models built from these 
labeled scenes. The top row shows a sample label image and the distribution of the objects 
relative to the center red (car or screen) object. The bottom rows show the distributions of 
parts that make up each object. 


recognition architecture (Murphy, Torralba, and Freeman 2003; Sudderth, Torralba, Freeman 
et al. 2008; Crandall and Huttenlocher 2007). 

Consider the street and office scenes shown in Figure 14.50a-b. If we have enough train- 
ing images with labeled regions, such as buildings, cars, and roads or monitors, keyboards, 
and mice, we can develop a geometric model for describing their relative positions. Sud- 
derth, Torralba, Freeman et al. (2008) develop such a model, which can be thought of as a 
two-level constellation model. At the top level, the distributions of objects relative to each 
other (say, buildings with respect to cars) is modeled as a Gaussian (Figure 14.50c, upper 
right corners). At the bottom level, the distribution of parts (affine covariant features) with 
respect to the object center is modeled using a mixture of Gaussians (Figure 14.50c, lower 
two rows). However, since the number of objects in the scene and parts in each object is 
unknown, a latent Dirichlet process (LDP) is used to model object and part creation in a gen- 
erative framework. The distributions for all of the objects and parts are learned from a large 
labeled database and then later used during inference (recognition) to label the elements of a 
scene. 

Another example of context is in simultaneous segmentation and recognition (Section 14.4.3) 
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(Figures 14.44-14.45), where the arrangements of various objects in a scene are used as part 
of the labeling process. Torralba, Murphy, and Freeman (2004) describe a conditional random 
field where the estimated locations of building and roads influence the detection of cars, and 
where boosting is used to learn the structure of the CRF. Rabinovich, Vedaldi, Galleguillos 
et al. (2007) use context to improve the results of CRF segmentation by noting that certain 
adjacencies (relationships) are more likely than others, e.g., a person is more likely to be on 
a horse than on a dog. 

Context also plays an important role in 3D inference from single images (Figure 14.47), 
using computer vision techniques for labeling pixels as belonging to the ground, vertical 
surfaces, or sky (Hoiem, Efros, and Hebert 2005a,b). This line of work has been extended to 
a more holistic approach that simultaneously reasons about object identity, location, surface 
orientations, occlusions, and camera viewing parameters (Hoiem, Efros, and Hebert 2008a,b). 

A number of approaches use the gist of a scene (Torralba 2003; Torralba, Murphy, Free- 
man el al. 2003) to determine where instances of particular objects are likely to occur. For 
example, Murphy, Torralba, and Freeman (2003) train a regressor to predict the vertical loca- 
tions of objects such as pedestrians, cars, and buildings (or screens and keyboard for indoor 
office scenes) based on the gist of an image. These location distributions are then used with 
classic object detectors to improve the performance of the detectors. Gists can also be used to 
directly match complete images, as we saw in the scene completion work of Hays and Efros 
(2007). 

Finally, some of the most recent work in scene understanding exploits the existence of 
large numbers of labeled (or even unlabeled) images to perform matching directly against 
whole images, where the images themselves implicitly encode the expected relationships 
between objects (Figure 14.51) (Russell, Torralba, Liu el al. 2007; Malisiewicz and Efros 
2008). We discuss such techniques in the next section, where we look at the influence that 
large image databases have had on object recognition and scene understanding. 


14.5.1 Learning and large image collections 

Given how learning techniques are widely used in recognition algorithms, you may wonder 
whether the topic of learning deserves its own section (or even chapter), or whether it is just 
part of the basic fabric of all recognition tasks. In fact, trying to build a recognition system 
without lots of training data for anything other than a basic pattern such as a UPC code has 
proven to be a dismal failure. 

In this chapter, we have already seen lots of techniques borrowed from the machine learn- 
ing, statistics, and pattern recognition communities. These include principal component, sub- 
space, and discriminant analysis (Section 14.2.1) and more sophisticated discriminative clas- 
sification algorithms such as neural networks, support vector machines, and boosting (Sec- 
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Figure 14.51 Recognition by scene alignment (Russell, Torralba, Liu et al. 2007): (a) input 
image; (b) matched images with similar scene configurations; (c) final labeling of the input 
image. 


tion 14.1.1). Some of the best-performing techniques on challenging recognition benchmarks 
(Varma and Ray 2007; Felzenszwalb, McAllester, and Ramanan 2008; Fritz and Schiele 2008; 
Vedaldi, Gulshan, Varma et al. 2009) rely heavily on the latest machine learning techniques, 
whose development is often being driven by challenging vision problems (Freeman, Perona, 
and Scholkopf 2008). 

A distinction sometimes made in the recognition community is between problems where 
most of the variables of interest (say, parts) are already (partially) labeled and systems that 
learn more of the problem structure with less supervision (Fergus, Perona, and Zisserman 
2007; Fei-Fei, Fergus, and Perona 2006). In fact, recent work by Sivic, Russell, Zisserman et 
al. (2008) has demonstrated the ability to learn visual hierarchies (hierarchies of object parts 
with related visual appearance) and scene segmentations in a totally unsupervised framework. 

Perhaps the most dramatic change in the recognition community has been the appearance 
of very large databases of training images. 20 Early learning-based algorithms, such as those 
for face and pedestrian detection (Section 14.1), used relatively few (in the hundreds) labeled 
examples to train recognition algorithm parameters (say, the thresholds used in boosting). To- 
day, some recognition algorithms use databases such as LabelMe (Russell, Torralba, Murphy 
et al. 2008), which contain tens of thousands of labeled examples. 

The existence of such large databases opens up the possibility of matching directly against 
the training images rather than using them to learn the parameters of recognition algorithms. 
Russell, Torralba, Liu et al. (2007) describe a system where a new image is matched against 
each of the training images, from which a consensus labeling for the unknown objects in 
the scene can be inferred, as shown in Figure 14.51. Malisiewicz and Efros (2008) start 
by over-segmenting each image and then use the LabelMe database to search for similar 
images and configurations in order to obtain per-pixel category labelings. It is also possible 
to combine feature -based correspondence algorithms with large labeled databases to perform 

20 We have already seen some computational photography applications of such databases in Section 14.4.4. 
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Figure 14.52 Recognition using tiny images (Torralba, Freeman, and Fergus 2008) © 2008 
IEEE: columns (a) and (c) show sample input images and columns (b) and (d) show the 
corresponding 16 nearest neighbors in the database of 80 million tiny images. 


simultaneous recognition and segmentation (Liu, Yuen, and Torralba 2009). 

When the database of images becomes large enough, it is even possible to directly match 
complete images with the expectation of finding a good match. Torralba, Freeman, and Fergus 
(2008) start with a database of 80 million tiny (32 x 32) images and compensate for the poor 
accuracy in their image labels, which are collected automatically from the Internet, by using 
a semantic taxonomy (Wordnet) to infer the most likely labels for a new image. Somewhere 
in the 80 million images, there are enough examples to associate some set of images with 
each of the 75,000 non-abstract nouns in Wordnet that they use in their system. Some sample 
recognition results are shown in Figure 14.52. 

Another example of a large labeled database of images is ImageNet (Deng, Dong, Socher 
el al. 2009), which is collecting images for the 80,000 nouns (synonym sets) in WordNet 
(Fellbaum 1998). As of April 2010, about 500-1000 carefully vetted examples for 14841 
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Figure 14.53 ImageNet (Deng, Dong, Socher et al. 2009) © 2009 IEEE. This database 
contains over 500 carefully vetted images for each of 14,841 (as of April, 2010) nouns from 
the WordNet hierarchy. 


synsets have been collected (Figure 14.53). The paper by Deng, Dong, Socher et al. (2009) 
also has a nice review of related databases. 

As we mentioned in Section 14.4.3, the existence of large databases of partially labeled 
Internet imagery has given rise to a new sub-field of Internet computer vision, with its own 
workshops 21 and a special journal issue (Avidan, Baker, and Shan 2010). 

14.5.2 Application : Image search 

Even though visual recognition algorithms are by some measures still in their infancy, they 
are already starting to have some impact on image search, i.e., the retrieval of images from the 
Web using combinations of keywords and visual similarity. Today, most image search engines 
rely mostly on textual keywords found in captions, nearby text, and filenames, augmented by 
user click-through data (Craswell and Szummer 2007). As recognition algorithms continue 
to improve, however, visual features and visual similarity will start being used to recognize 
images with missing or erroneous keywords. 

The topic of searching by visual similarity has a long history and goes by a variety of 
names, including content-based image retrieval (CBIR) (Smeulders, Worring, Santini et al. 
2000; Lew, Sebe, Djeraba et al. 2006; Vasconcelos 2007; Datta, Joshi, Li et al. 2008) and 
query by image content (QBIC) (Flickner, Sawhney, Niblack et al. 1995). Original publica- 
tions in these fields were based primarily on simple whole-image similarity metrics, such as 
color and texture (Swain and Ballard 1991; Jacobs, Finkelstein, and Salesin 1995; Manjunathi 
and Ma 1996). 

21 
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In more recent work, Fergus, Perona, and Zisserman (2004) use a feature-based learning 
and recognition algorithm to re-rank the outputs from a traditional keyword-based image 
search engine. In follow-on work, Fergus, Fei-Fei, Perona et al. (2005) cluster the results 
returned by image search using an extension of probabilistic latest semantic analysis (PLSA) 
(Hofmann 1999) and then select the clusters associated with the highest ranked results as the 
representative images for that category. 

Even more recent work relies on carefully annotated image databases such as LabelMe 
(Russell, Torralba, Murphy et al. 2008). For example, Malisiewicz and Efros (2008) describe 
a system that, given a query image, can find similar LabelMe images, whereas Liu, Yuen, and 
Torralba (2009) combine feature-based correspondence algorithms with the labeled database 
to perform simultaneous recognition and segmentation. 


14.6 Recognition databases and test sets 

In addition to rapid advances in machine learning and statistical modeling techniques, one 
of the key ingredients in the continued improvement of recognition algorithms has been the 
increased availability and quality of image recognition databases. 

Tables 14.1 and 14.2, which are based on similar tables in Fei-Fei, Fergus, and Torralba 
(2009), updated with more recent entries and URLs, show some of the mostly widely used 
recognition databases. Some of these databases, such as the ones for face recognition and 
localization, date back over a decade. The most recent ones, such as the PASCAL database, 
are refreshed annually with ever more challenging problems. Table 14.1 shows examples of 
databases used primarily for (whole image) recognition while Table 14.2 shows databases 
where more accurate localization or segmentation information is available and expected. 

Ponce, Berg, Everingham et al. (2006) discuss some of the problems with earlier datasets 
and describe how the latest PASCAL Visual Object Classes Challenge aims to overcome 
these. Some examples of the 20 visual classes in the 2008 challenge are shown in Fig- 
ure 14.54. The slides from the VOC workshops, 22 are a great source for pointers to the 
best recognition techniques currently available. 

Two of the most recent trends in recognition databases are the emergence of Web-based 
annotation and data collection tools, and the use of search and recognition algorithms to build 
up databases (Ponce, Berg, Everingham et al. 2006). Some of the most interesting work in 
human annotation of images comes from a series of interactive multi-person games such as 
ESP (von Ahn and Dabbish 2004) and Peekaboom (von Ahn, Liu, and Blum 2006). In these 
games, people help each other guess the identity of a hidden image by giving textual clues 
as to its contents, which implicitly labels either the whole image or just regions. A more 

22 http://pascallin.ecs. soton.ac.uk/challenges/VOC/. 
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Name / URL Extents 

Face and person recognition 

Yale face database Centered face images 

http://wwwl.cs.columbia.edu/~belhumeur/ 

Resources for face detection Various databases 

http://vision.ai.uiuc.edu/mhyang/face-detection-survey.html 

FERET Centered face images 

http://www.frvt.org/FERET 

FRVT Centered face images 

http://www.frvt.org/ 

CMU PIE database Centered face image 

http://www.ri.cmu.edu/projects/project_418.html 

CMU Multi-PIE database Centered face image 

http://multipie.org 

Faces in the Wild Internet images 

http://vis-www.cs.umass.edu/lfw/ 

Consumer image person DB Complete images 

http://chenlab.ece.comell.edu/people/Andy/GallagherDataset.html 

Object recognition 

Caltech 101 Segmentation masks 

http://www.vision.caltech.edu/Image_Datasets/Caltechl01/ 

Caltech 256 Centered objects 

http://www.vision.caltech.edu/ImageJDatasets/Caltech256/ 

COIL-lOO Centered objects 

http://wwwl.cs.columbia.edu/CAVE/software/softlib/coil- lOO.php 

ETH-80 Centered objects 

http://www.mis.tu-darmstadt.de/datasets 
Instance recognition benchmark Objects in various poses 
http://vis.uky.edu/~stewe/ukbench/ 

Oxford buildings dataset Pictures of buildings 

http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/ 

NORB Bounding box 

http://www.cs.nyu.edU/~ylclab/data/norb-vl.0/ 

Tiny images Complete images 

http://people.csail.mit.edu/torralba/tinyimages/ 

ImageNet Complete images 

http://www.image-net.org/ 


Contents / Reference 


Frontal faces 

Belhumeur, Hespanha, and Kriegman (1997) 
Faces in various poses 

Yang, Kriegman, and Ahuja (2002) 

Frontal faces 

Phillips, Moon, Rizvi et al. (2000) 

Faces in various poses 

Phillips, Scruggs, O’Toole et al. (2010) 
Faces in various poses 
Sim, Baker, and Bsat (2003) 

Faces in various poses 

Gross, Matthews, Cohn et al. (2010) 

Faces in various poses 

Huang, Ramesh, Berg et al. (2007) 

People 

Gallagher and Chen (2008) 


101 categories 

Fei-Fei, Fergus, and Perona (2006) 
256 categories and clutter 
Griffin, Holub, and Perona (2007) 

100 instances 

Nene, Nayar, and Murase (1996) 

8 instances, 10 views 
Leibe and Schiele (2003) 

2550 objects 

Nister and Stewenius (2006) 

5062 images 

Philbin, Chum, Isard et al. (2007) 

50 toys 

LeCun, Huang, and Bottou (2004) 

75.000 (Wordnet) things 
Torralba, Freeman, and Fergus (2008) 

14.000 (Wordnet) things 
Deng, Dong, Socher et al. (2009 ) 


Table 14.1 Image databases for recognition, adapted and expanded from Fei-Fei, Fergus, 
and Torralba (2009). 
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Name /URL 

Extents 

Contents / Reference 

Object detection / localization 

CMU frontal faces 

Patches 

Frontal faces 

http://vasc.ri.cmu.edu/idb/html/face/frontal_images 

Rowley, Baluja, and Kanade (1998a) 

MIT frontal faces 

Patches 

Frontal faces 

http : //cbcl . mit.edu/ software- datasets/FaceData2 . html 

Sung and Poggio (1998) 

CMU face detection databases 

Multiple faces 

Faces in various poses 

http://www.ri.cmu.edu/research_project_detail.html?project_id=419 

Schneiderman and Kanade (2004) 

UIUC Image DB 

Bounding boxes 

Cars 

http://12r.cs.uiuc.edu/~cogcomp/Data/Car/ 

Agarwal and Roth (2002) 

Caltech Pedestrian Dataset 

Bounding boxes 

Pedestrians 

http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ 

Dollar, Wojek, Schiele et al. (2009) 

Graz-02 Database 

Segmentation masks 

Bikes, cars, people 

http ://www.emt. tugraz . at/~ pinz / data/GRAZ _02/ 

Opelt, Pinz, Fussenegger et al. (2006) 

ETHZ Toys 

Cluttered images 

Toys, boxes, magazines 

http://www.vision.ee.ethz.ch/~calvin/datasets.html 

Ferrari, Tuytelaars, and Van Gool (2006b) 

TU Darmstadt DB 

Segmentation masks 

Motorbikes, cars, cows 

http://www.vision.ee.ethz.ch/~bleibe/data/datasets.html 

Leibe, Leonardis, and Schiele (2008) 

MSR Cambridge 

Segmentation masks 

23 classes 

http://research.microsoft.com/en-us/projects/objectclassrecognition/ 

Shotton, Winn, Rother et al. (2009) 

LabelMe dataset 

Polygonal boundary 

>500 categories 

http : //labelme . c sail . mit.edu/ 


Russell, Torralba, Murphy et al. (2008) 

Lotus Hill 

Segmentation masks 

Scenes and hierarchies 

http://www.imageparsing.com/ 


Yao, Yang, Lin et al. (2010) 

On-line annotation tools 

ESP game 

Image descriptions 

Web images 

http : //w w w. g wap .com/ g wap/ 


von Ahn and Dabbish (2004) 

Peekaboom 

Labeled regions 

Web images 

http://www.gwap.com/gwap/ 


von Ahn, Liu, and Blum (2006) 

LabelMe 

Polygonal boundary 

High-resolution images 

http : //labelme . csail . mit.edu/ 


Russell, Torralba, Murphy et al. (2008) 

Collections of challenges 

PASCAL 

Segmentation, boxes 

Various 

http://pascallin.ecs.soton.ac.uk/challenges/VOC/ 

Everingham, Van Gool, Williams et al. (2010) 


Table 14.2 Image databases for detection and localization, adapted and expanded from Fei- 
Fei, Fergus, and Torralba (2009). 
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Figure 14.54 Sample images from the PASCAL Visual Object Classes Challenge 2008 
(VOC2008) database (Everingham, Van Gool, Williams et al. 2008). The original images 
were obtained from flickr (http://www.flickr.com/) and the database rights are explained on 
http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/. 


“serious” volunteer effort is the LabelMe database, in which vision researchers contribute 
manual polygonal region annotations in return for gaining access to the database (Russell, 
Torralba, Murphy et al. 2008). 

The use of computer vision algorithms for collecting recognition databases dates back to 
the work of Fergus, Fei-Fei, Perona et al. (2005), who cluster the results returned by Google 
image search using an extension of PLSA and then select the clusters associated with the 
highest ranked results. More recent examples of related techniques include the work of Berg 
and Forsyth (2006) and Li and Fei-Fei (2010). 

Whatever methods are used to collect and validate recognition databases, they will con- 
tinue to grow in size, utility, and difficulty from year to year. They will also continue to be 
an essential component of research into the recognition and scene understanding problems, 
which remain, as always, the grand challenges of computer vision. 
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14.7 Additional reading 

Although there are currently no specialized textbooks on image recognition and scene un- 
derstanding, some surveys (Pinz 2005) and collections of papers (Ponce, Hebert, Schmid et 
al. 2006; Dickinson, Leonardis, Schiele et al. 2007) can be found that describe the latest ap- 
proaches. Other good sources of recent research are courses on this topic, such as the ICCV 
2009 short course (Fei-Fei, Fergus, and Torralba 2009) and Antonio Torralba’s more com- 
prehensive MIT course (Torralba 2008). The PASCAL VOC Challenge Web site contains 
workshop slides that summarize today’s best performing algorithms. 

The literature on face, pedestrian, car, and other object detection is quite extensive. Sem- 
inal papers in face detection include those by Osuna, Freund, and Girosi (1997), Sung and 
Poggio (1998), Rowley, Baluja, and Kanade (1998a), and Viola and Jones (2004), with Yang, 
Kriegman, and Ahuja (2002) providing a comprehensive survey of early work in this field. 
More recent examples include (Heisele, Ho, Wu et al. 2003; Heisele, Serre, and Poggio 2007). 

Early work in pedestrian and car detection was carried out by Gavrila and Philomin 
(1999), Gavrila (1999), Papageorgiou and Poggio (2000), Mohan, Papageorgiou, and Pog- 
gio (2001), and Schneiderman and Kanade (2004). More recent examples include the work 
of Belongie, Malik, and Puzicha (2002), Mikolajczyk, Schmid, and Zisserman (2004), Dalai 
and Triggs (2005), Leibe, Seemann, and Schiele (2005), Dalai, Triggs, and Schmid (2006), 
Opelt, Pinz, and Zisserman (2006), Torralba (2007), Andriluka, Roth, and Schiele (2008), 
Felzenszwalb, McAllester, and Ramanan (2008), Rogez, Rihan, Ramalingam et al. (2008), 
Andriluka, Roth, and Schiele (2009), Kumar, Zisserman, and H.S.Torr (2009), Dollar, Be- 
longie, and Perona (2010). and Felzenszwalb, Girshick, McAllester et al. (2010). 

While some of the earliest approaches to face recognition involved finding the distinc- 
tive image features and measuring the distances between them (Fischler and Elschlager 1973; 
Kanade 1977; Yuille 1991), more recent approaches rely on comparing gray-level images, 
often projected onto lower dimensional subspaces (Turk and Pentland 1991a; Belhumeur, 
Hespanha, and Kriegman 1997; Moghaddam and Pentland 1997; Moghaddam, Jebara, and 
Pentland 2000; Heisele, Ho, Wu et al. 2003; Heisele, Serre, and Poggio 2007). Additional 
details on principal component analysis (PCA) and its Bayesian counterparts can be found in 
Appendix B.1.1 and books and articles on this topic (Hastie, Tibshirani, and Friedman 2001; 
Bishop 2006; Roweis 1998; Tipping and Bishop 1999; Leonardis and Bischof 2000; Vidal, 
Ma, and Sastry 2010). The topics of subspace learning, local distance functions, and metric 
learning are covered by Cai, He, Hu et al. (2007), Frame, Singer, Sha et al. (2007), Guil- 
laumin, Verbeek, and Schmid (2009), Ramanan and Baker (2009), and Sivic, Everingham, 
and Zisserman (2009). An alternative to directly matching gray-level images or patches is to 
use non-linear local transforms such as local binary patterns (Ahonen, Hadid, and Pietikainen 
2006; Zhao and Pietikainen 2007; Cao, Yin, Tang et al. 2010). 
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In order to boost the performance of what are essentially 2D appearance-based models, 
a variety of shape and pose deformation models have been developed (Beymer 1996; Vet- 
ter and Poggio 1997), including Active Shape Models (Lanitis, Taylor, and Cootes 1997; 
Cootes, Cooper, Taylor et al. 1995; Davies, Twining, and Taylor 2008), Elastic Bunch Graph 
Matching (Wiskott, Fellous, Kruger et al. 1997), 3D Morphable Models (Blanz and Vetter 
1999), and Active Appearance Models (Costen, Cootes, Edwards el al. 1999; Cootes, Ed- 
wards, and Taylor 2001; Gross, Baker, Matthews et al. 2005; Gross, Matthews, and Baker 
2006; Matthews, Xiao, and Baker 2007; Liang, Xiao, Wen et al. 2008; Ramnath, Koterba, 
Xiao et al. 2008). The topic of head pose estimation, in particular, is covered in a recent 
survey by Murphy-Chutorian and Trivedi (2009). 

Additional information about face recognition can be found in a number of surveys and 
books on this topic (Chellappa, Wilson, and Sirohey 1995; Zhao, Chellappa, Phillips et al. 
2003; Li and Jain 2005) as well as on the Face Recognition Web site. 2 ’ Databases for face 
recognition are discussed by Phillips, Moon, Rizvi et al. (2000), Sim, Baker, and Bsat (2003), 
Gross, Shi, and Cohn (2005), Huang, Ramesh, Berg et al. (2007), and Phillips, Scruggs, 
O’Toole etal. (2010). 

Algorithms for instance recognition, i.e., the detection of static man-made objects that 
only vary slightly in appearance but may vary in 3D pose, are mostly based on detecting 
2D points of interest and describing them using viewpoint-invariant descriptors (Lowe 2004; 
Rothganger, Lazebnik, Schmid et al. 2006; Ferrari, Tuytelaars, and Van Gool 2006b; Gordon 
and Lowe 2006; Obdrzalek and Matas 2006; Kannala, Rahtu, Brandt et al. 2008; Sivic and 
Zisserman 2009). 

As the size of the database being matched increases, it becomes more efficient to quantize 
the visual descriptors into words (Sivic and Zisserman 2003; Schindler, Brown, and Szeliski 
2007; Sivic and Zisserman 2009; Turcot and Lowe 2009), and to then use information- 
retrieval techniques, such as inverted indices (Nister and Stewenius 2006; Philbin, Chum, 
Isard et al. 2007; Philbin, Chum, Sivic et al. 2008), query expansion (Chum, Philbin, Sivic 
et al. 2007; Agarwal, Snavely, Simon et al. 2009), and min hashing (Philbin and Zisserman 
2008; Li, Wu, Zach et al. 2008; Chum, Philbin, and Zisserman 2008; Chum and Matas 2010) 
to perform efficient retrieval and clustering. 

A number of surveys, collections of papers, and course notes have been written on the 
topic of category recognition (Pinz 2005; Ponce, Hebert, Schmid et al. 2006; Dickinson, 
Leonardis, Schiele et al. 2007; Fei-Fei, Fergus, and Torralba 2009). Some of the seminal 
papers on the bag of words (bag of keypoints) approach to whole-image category recognition 
have been written by Csurka, Dance, Fan et al. (2004), Lazebnik, Schmid, and Ponce (2006), 
Csurka, Dance, Perronnin et al. (2006), Grauman and Darrell (2007b), and Zhang, Marszalek, 
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Lazebnik et al. (2007). Additional and more recent papers in this area include Sivic, Russell, 
Efros et al. (2005), Serre, Wolf, and Poggio (2005), Opelt, Pinz, Fussenegger et al. (2006), 
Grauman and Darrell (2007a), Torralba, Murphy, and Freeman (2007), Boiman, Shechtman, 
and Irani (2008), Ferencz, Fearned-Miller, and Malik (2008), and Mutch and Fowe (2008). 
It is also possible to recognize objects based on their contours, e.g., using shape contexts 
(Belongie, Malik, and Puzicha 2002) or other techniques (Jurie and Schmid 2004; Shotton, 
Blake, and Cipolla 2005; Opelt, Pinz, and Zisserman 2006; Ferrari, Tuytelaars, and Van Gool 
2006a). 

Many object recognition algorithms use part-based decompositions to provide greater in- 
variance to articulation and pose. Early algorithms focused on the relative positions of the 
parts (Fischler and Elschlager 1973; Kanade 1977; Yuille 1991) while newer algorithms use 
more sophisticated models of appearance (Felzenszwalb and Huttenlocher 2005; Fergus, Per- 
ona, and Zisserman 2007; Felzenszwalb, Me Allester, and Ramanan 2008). Good overviews 
on part-based models for recognition can be found in the course notes of Fergus 2007b; 2009. 

Carneiro and Fowe (2006) discuss a number of graphical models used for part-based 
recognition, which include trees and stars (Felzenszwalb and Huttenlocher 2005; Fergus, Per- 
ona, and Zisserman 2005; Felzenszwalb, Me Allester, and Ramanan 2008), fc-fans (Crandall, 
Felzenszwalb, and Huttenlocher 2005; Crandall and Huttenlocher 2006), and constellations 
(Burl, Weber, and Perona 1998; Weber, Welling, and Perona 2000; Fergus, Perona, and Zis- 
serman 2007). Other techniques that use part-based recognition include those developed by 
Dorko and Schmid (2003) and Bar-Hillel, Hertz, and Weinshall (2005). 

Combining object recognition with scene segmentation can yield strong benefits. One 
approach is to pre-segment the image into pieces and then match the pieces to portions of 
the model (Mori, Ren, Efros et al. 2004; Mori 2005; He, Zemel, and Ray 2006; Russell, 
Efros, Sivic et al. 2006; Borenstein and Ullman 2008; Csurka and Perronnin 2008; Gu, Fim, 
Arbelaez et al. 2009). Another is to vote for potential object locations and scales based on 
object detection (Feibe, Feonardis, and Schiele 2008). One of the currently most popular 
approaches is to use conditional random fields (Kumar and Hebert 2006; He, Zemel, and 
Carreira-Perpinan 2004; He, Zemel, and Ray 2006; Fevin and Weiss 2006; Winn and Shotton 
2006; Hoiem, Rother, and Winn 2007; Rabinovich, Vedaldi, Galleguillos et al. 2007; Verbeek 
and Triggs 2007; Yang, Meer, and Foran 2007; Batra, Sukthankar, and Chen 2008; Farlus 
and lurie 2008; He and Zemel 2008; Shotton, Winn, Rother et al. 2009; Kumar, Torr, and 
Zisserman 2010), which produce some of the best results on the difficult PASCAF VOC seg- 
mentation challenge (Shotton, Johnson, and Cipolla 2008; Kohli, Fadicky, and Torr 2009). 

More and more recognition algorithms are starting to use scene context as part of their 
recognition strategy. Representative papers in this area include those by Torralba (2003), 
Torralba, Murphy, Freeman et al. (2003), Murphy, Torralba, and Freeman (2003), Torralba, 
Murphy, and Freeman (2004), Crandall and Huttenlocher (2007), Rabinovich, Vedaldi, Gal- 
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leguillos et al. (2007), Russell, Torralba, Liu et al. (2007), Hoiem, Efros, and Hebert (2008a), 
Hoiem, Efros, and Hebert (2008b), Sudderth, Torralba, Freeman et al. (2008), and Divvala, 
Hoiem, Hays et al. (2009). 

Sophisticated machine learning techniques are also becoming a key component of suc- 
cessful object detection and recognition algorithms (Varma and Ray 2007; Felzenszwalb, 
Me Allester, and Ramanan 2008; Fritz and Schiele 2008; Sivic, Russell, Zisserman et al. 
2008; Vedaldi, Gulshan, Varma et al. 2009), as is exploiting large human-labeled databases 
(Russell, Torralba, Liu et al. 2007; Malisiewicz and Efros 2008; Torralba, Freeman, and Fer- 
gus 2008; Liu, Yuen, and Torralba 2009). Rough three-dimensional models are also making 
a comeback for recognition, as evidenced in some recent papers (Savarese and Fei-Fei 2007, 
2008; Sun, Su, Savarese et al. 2009; Su, Sun, Fei-Fei et al. 2009). As always, the latest con- 
ferences on computer vision are your best reference for the newest algorithms in this rapidly 
evolving field. 


14.8 Exercises 

Ex 14.1: Face detection Build and test one of the face detectors presented in Section 14.1.1. 

1. Download one or more of the labeled face detection databases in Table 14.2. 

2. Generate your own negative examples by finding photographs that do not contain any 
people. 

3. Implement one of the following face detectors (or devise one of your own): 

• boosting (Algorithm 14. 1) based on simple area features, with an optional cascade 
of detectors (Viola and Jones 2004); 

• PCA face subspace (Moghaddam and Pentland 1997); 

• distances to clustered face and non-face prototypes, followed by a neural network 
(Sung and Poggio 1998) or SVM (Osuna, Freund, and Girosi 1997) classifier; 

• a multi -resolution neural network trained directly on normalized gray-level patches 
(Rowley, Baluja, and Kanade 1998a). 

4. Test the performance of your detector on the database by evaluating the detector at ev- 
ery location in a sub-octave pyramid. Optionally retrain your detector on false positive 
examples you get on non-face images. 

Ex 14.2: Determining the threshold for AdaBoost Given a set of function evaluations on 
the training examples Xi, fi — f(xi) £ ±1, training labels %)i £ ±1, and weights Wi £ (0, 1), 
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as explained in Algorithm 14.1, devise an efficient algorithm to find values of 9 and s = ±1 
that maximize 


where h( x, 9) = sign(x — 9). 

Ex 14.3: Face recognition using eigenfaces Collect a set of facial photographs and then 
build a recognition system to re-recognize the same people. 

1. Take several photos of each of your classmates and store them. 

2. Align the images by automatically or manually detecting the corners of the eyes and 
using a similarity transform to stretch and rotate each image to a canonical position. 

3. Compute the average image and a PC A subspace for the face images 

4. Take a new set of photographs a week later and use them as your test set. 

5. Compare each new image to each database image and select the nearest one as the 
recognized identity. Verify that the distance in PCA space is close to the distance 
computed with a full SSD (sum of squared difference) measure. 

6. (Optional) Compute different principal components for identity and expression, and 
use them to improve your recognition results. 

Ex 14.4: Bayesian face recognition Moghaddam, Jebara, and Pentland (2000) compute 
separate covariance matrices X/ and X E by looking at differences between all pairs of im- 
ages. At run time, they select the nearest image to determine the facial identity. Does it make 
sense to estimate statistics for all pairs of images and use them for testing the distance to the 
nearest exemplar? Discuss whether this is statistically correct. 

How is the all-pair intrapersonal covariance matrix X/ related to the within-class scatter 
matrix S w? Does a similar relationship hold between X/,- and S b? 

Ex 14.5: Modular eigenfaces Extend your face recognition system to separately match the 
eye, nose, and mouth regions, as shown in Figure 14.18. 

1 . After normalizing face images to a canonical scale and location, manually segment out 
some of the eye, nose, and face regions. 

2. Build separate detectors for these three (or four) kinds of region, either using a subspace 
(PCA) approach or one of the techniques presented in Section 14.1.1. 

3. For each new image to be recognized, first detect the locations of the facial features. 
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4. Then, match the individual features against your database and note the locations of 
these features. 

5. Train and test a classifier that uses the individual feature matching IDs as well as (op- 
tionally) the feature locations to perform face recognition. 

Ex 14.6: Recognition-based color balancing Build a system that recognizes the most im- 
portant color areas in common photographs (sky, grass, skin) and color balances the image 
accordingly. Some references and ideas for skin detection are given in Exercise 2.8 and 
by Forsyth and Fleck (1999), Jones and Rehg (2001), Vezhnevets, Sazonov, and Andreeva 
(2003), and Kakumanu, Makrogiannis, and Bourbakis (2007). These may give you ideas 
for how to detect other regions or you can try more sophisticated MRF-based approaches 
(Shotton, Winn, Rother el al. 2009). 

Ex 14.7: Pedestrian detection Build and test one of the pedestrian detectors presented in 
Section 14.1.2. 

Ex 14.8: Simple instance recognition Use the feature detection, matching, and alignment 
algorithms you developed in Exercises 4. 1 — 4.4 and 9.2 to find matching images given a query 
image or region (Figure 14.26). 

Evaluate several feature detectors, descriptors, and robust geometric verification strate- 
gies, either on your own or by comparing your results with those of classmates. 

Ex 14.9: Large databases and location recognition Extend the previous exercise to larger 
databases using quantized visual words and information retrieval techniques, as described in 
Algorithm 14.2. 

Test your algorithm on a large database, such as the one used by Nister and Stewenius 
(2006) or Philbin, Chum, Sivic et al. (2008), which are listed in Table 14.1. Alternatively, 
use keyword search on the Web or in a photo sharing site (e.g., for a city) to create your own 
database. 

Ex 14.10: Bag of words Adapt the feature extraction and matching pipeline developed in 
Exercise 14.8 to category (class) recognition, using some of the techniques described in Sec- 
tion 14.4.1. 

1. Download the training and test images from one or more of the databases listed in 
Tables 14.1 and 14.2, e.g., Caltech 101, Caltech 256, or PASCAL VOC. 

2. Extract features from each of the training images, quantize them, and compute the tf-idf 
vectors (bag of words histograms). 
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3. As an option, consider not quantizing the features and using pyramid matching (14.40- 
14.41) (Grauman and Darrell 2007b) or using a spatial pyramid for greater selectivity 
(Lazebnik, Schmid, and Ponce 2006). 

4. Choose a classification algorithm (e.g., nearest neighbor classification or support vector 
machine) and “train” your recognizer, i.e., build up the appropriate data structures (e.g., 
k-d trees) or set the appropriate classifier parameters. 

5. Test your algorithm on the test data set using the same pipeline you developed in steps 
2—4 and compare your results to the best reported results. 

6. Explain why your results differ from the previously reported ones and give some ideas 
for how you could improve your system. 

You can find a good synopsis of the best-performing classification algorithms and their ap- 
proaches in the report of the PASCAL Visual Object Classes Challenge found on their Web 
site (http://pascallin.ecs.soton.ac.uk/challengesWOC/). 

Ex 14.11: Object detection and localization Extend the classification algorithm developed 
in the previous exercise to localize the objects in an image by reporting a bounding box around 
each detected object. The easiest way to do this is to use a sliding window approach. Some 
pointers to recent techniques in this area can be found in the workshop associated with the 
PASCAL VOC 2008 Challenge. 

Ex 14.12: Part-based recognition Choose one or more of the techniques described in Sec- 
tion 14.4.2 and implement a part-based recognition system. Since these techniques are fairly 
involved, you will need to read several of the research papers in this area, select which gen- 
eral approach you want to follow, and then implement your algorithm. A good starting point 
could be the paper by Felzenszwalb, McAllester, and Ramanan (2008), since it performed 
well in the PASCAL VOC 2008 detection challenge. 

Ex 14.13: Recognition and segmentation Choose one or more of the techniques described 
in Section 14.4.3 and implement a simultaneous recognition and segmentation system. Since 
these techniques are fairly involved, you will need to read several of the research papers in this 
area, select which general approach you want to follow, and then implement your algorithm. 
Test your algorithm on one or more of the segmentation databases in Table 14.2. 

Ex 14.14: Context Implement one or more of the context and scene understanding sys- 
tems described in Section 14.5 and report on your experience. Does context or whole scene 
understanding perform better at naming objects than stand-alone systems? 
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Ex 14.15: Tiny images Download the tiny images database from http://people.csail.mit. 
edu/torralba/tinyimages/ and build a classifier based on comparing your test images directly 
against all of the labeled training images. Does this seem like a promising approach? 
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Chapter 15 


Conclusion 


In this book, we have covered a broad range of computer vision topics. Starting with image 
formation, we have seen how images can be pre-processed to remove noise or blur, segmented 
into regions, or converted into feature descriptors. Multiple images can be matched and 
registered, with the results used to estimate motion, track people, reconstruct 3D models, 
or merge images into more attractive and interesting composites and renderings. Images can 
also be analyzed to produce semantic descriptions of their content. However, the gap between 
computer and human performance in this area is still large and is likely to remain so for many 
years. 

Our study has also exposed us to a wide range of mathematical techniques. These include 
continuous mathematics, such as signal processing, variational approaches, three-dimensional 
and projective geometry, linear algebra, and least squares. We have also studied topics in 
discrete mathematics and computer science, such as graph algorithms, combinatorial opti- 
mization, and even database techniques for information retrieval. Since many problems in 
computer vision are inverse problems that involve estimating unknown quantities from noisy 
input data, we have also looked at Bayesian statistical inference techniques, as well as ma- 
chine learning techniques to learn probabilistic models from large amounts of training data. 
As the availability of partially labeled visual imagery on the Internet continues to increase 
exponentially, this latter approach will continue to have a major impact on our field. 

You may ask: why is our field so broad and aren’t there any unifying principles that can 
be used to simplify our study? Part of the answer lies in the expansive definition of com- 
puter vision, which is the analysis of images and video, as well as the incredible complexity 
inherent in the formation of visual imagery. In some ways, our field is as complex as the 
study of automotive engineering, which requires an understanding of internal combustion, 
mechanics, aerodynamics, ergonomics, electrical circuitry, and control systems, among other 
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topics. Computer vision similarly draws on a wide variety of sub-disciplines, which makes it 
challenging to cover in a one-semester course, let alone to achieve mastery during a course 
of graduate studies. Conversely, the incredible breadth and technical complexity of computer 
vision problems is what draws many people to this research field. 

Because of this richness and the difficulty in making and measuring progress, I have at- 
tempted to instill in my students and in readers of this book a discipline founded on principles 
from engineering, science, and statistics. 

The engineering approach to problem solving is to first carefully define the overall prob- 
lem being tackled and to question the basic assumptions and goals inherent in this process. 
Once this has been done, a number of alternative solutions or approaches are implemented 
and carefully tested, paying attention to issues such as reliability and computational cost. 
Finally, one or more solutions are deployed and evaluated in real-world settings. For this 
reason, this book contains many different alternatives for solving vision problems, many of 
which are sketched out in the exercises for students to implement and test on their own. 

The scientific approach builds upon a basic understanding of physical principles. In the 
case of computer vision, this includes the physics of man-made and natural structures, image 
formation, including lighting and atmospheric effects, optics, and noisy sensors. The task is to 
then invert this formation using stable and efficient algorithms to obtain reliable descriptions 
of the scene and other quantities of interest. The scientific approach also encourages us to 
formulate and test hypotheses, which is similar to the extensive testing and evaluation inherent 
in engineering disciplines. 

Lastly, because so much about the image formation process is inherently uncertain and 
ambiguous, a statistical approach that models both uncertainty in the world (e.g., the number 
and types of animals in a picture) and noise in the image formation process, is often essential. 
Bayesian inference techniques can then be used to combine prior and measurement models 
to estimate the unknowns and to model their uncertainty. Machine learning techniques can 
be used to create the probabilistic models in the first place. Efficient learning and inference 
algorithms, such as dynamic programming, graph cuts, and belief propagation, often play a 
crucial role in this process. 

Given the breadth of material we have covered in this book, what new developments are 
we likely to see in the future? As I have mentioned before, one of the recent trends in com- 
puter vision is using the massive amounts of partially labeled visual data on the Internet as 
sources for learning visual models of scenes and objects. We have already seen data-driven 
approaches succeed in related fields such as speech recognition, machine translation, speech 
and music synthesis, and even computer graphics (both in image-based rendering and anima- 
tion from motion capture). A similar process has been occurring in computer vision, with 
some of the most exciting new work occurring at the intersection of the object recognition 
and machine learning fields. 
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More traditional quantitative techniques in computer vision such as motion estimation, 
stereo correspondence, and image enhancement, all benefit from better prior models for im- 
ages, motions, and disparities, as well as efficient statistical inference techniques such as 
those for inhomogeneous and higher-order Markov random fields. Some techniques, such as 
feature matching and structure from motion, have matured to where they can be applied to 
almost arbitrary collections of images of static scenes. This has resulted in an explosion of 
work in 3D modeling from Internet datasets, which again is related to visual recognition from 
massive amounts of data. 

While these are all encouraging developments, the gap between human and machine per- 
formance in semantic scene understanding remains large. It may be many years before com- 
puters can name and outline all of the objects in a photograph with the same skill as a two- 
year-old child. However, we have to remember that human performance is often the result of 
many years of training and familiarity and often works best in special ecologically important 
situations. For example, while humans appear to be experts at face recognition, our actual 
performance when shown people we do not know well is not that good. Combining vision 
algorithms with general inference techniques that reason about the real world will likely lead 
to more breakthroughs, although some of the problems may turn out to be “Al-complete”, in 
the sense that a full emulation of human experience and intelligence may be necessary. 

Whatever the outcome of these research endeavors, computer vision is already having 
a tremendous impact in many areas, including digital photography, visual effects, medical 
imaging, safety and surveillance, and Web-based search. The breadth of the problems and 
techniques inherent in this field, combined with the richness of the mathematics and the 
utility of the resulting algorithms, will ensure that this remains an exciting area of study for 
years to come. 
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In this appendix, we introduce some elements of linear algebra and numerical techniques that 
are used elsewhere in the book. We start with some basic decompositions in matrix algebra, 
including the singular value decomposition (SVD), eigenvalue decompositions, and other 
matrix decompositions (factorizations). Next, we look at the problem of linear least squares, 
which can be solved using either the QR decomposition or normal equations. This is followed 
by non-linear least squares, which arise when the measurement equations are not linear in the 
unknowns or when robust error functions are used. Such problems require iteration to find 
a solution. Next, we look at direct solution (factorization) techniques for sparse problems, 
where the ordering of the variables can have a large influence on the computation and memory 
requirements. Finally, we discuss iterative techniques for solving large linear (or linearized) 
least squares problems. Good general references for much of this material include the work 
by Bjorck (1996), Golub and Van Loan (1996), Trefethen and Bau (1997), Meyer (2000), 
Nocedal and Wright (2006), and Bjorck and Dahlquist (2010). 


A note on vector and matrix indexing. To be consistent with the rest of the book and 
with the general usage in the computer science and computer vision communities, I adopt 
a 0-based indexing scheme for vector and matrix element indexing. Please note that most 
mathematical textbooks and papers use 1 -based indexing, so you need to be aware of the 
differences when you read this book. 


Software implementations. Highly optimized and tested libraries corresponding to the al- 
gorithms described in this appendix are readily available and are listed in Appendix C.2. 


A.l Matrix decompositions 

In order to better understand the structure of matrices and more stably perform operations 
such as inversion and system solving, a number of decompositions (or factorizations) can be 
used. In this section, we review singular value decomposition (SVD), eigenvalue decomposi- 
tion, QR factorization, and Cholesky factorization. 


A.1.1 Singular value decomposition 

One of the most useful decompositions in matrix algebra is the singular value decomposition 
(SVD), which states that any real-valued M x N matrix A can be written as 

AmxN = Umxp^pxpV T pxn 


(A.1) 
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where P = min(M, N). The matrices U and V are orthonormal, i.e., U T U = I and 
V T V = I, and so are their column vectors, 

Ui ■ Uj = Vi ■ Vj = Sij. ( A.2 ) 

The singular values are all non-negative and can be ordered in decreasing order 

(To > 01 > • • • > <jp-i > 0. (A. 3) 

A geometric intuition for the SVD of a matrix A can be obtained by re-writing A = 
UZV T in (A.2) as 

AV = C/S or Avj = (A.4) 

This formula says that the matrix A takes any basis vector Vj and maps it to a direction Uj 
with length aj, as shown in Figure A.l 

If only the first r singular values are positive, the matrix A is of rank r and the index p 
in the SVD decomposition (A.2) can be replaced by r. (In other words, we can drop the last 
p — r columns of U and V.) 

An important property of the singular value decomposition of a matrix (also true for 
the eigenvalue decomposition of a real symmetric non-negative definite matrix) is that if we 
truncate the expansion 

t 

A = (T :i u :i v J y ( A -5) 

3=0 

we obtain the best possible least squares approximation to the original matrix A. This is 
used both in eigenface-based face recognition systems (Section 14.2.1) and in the separable 
approximation of convolution kernels (3.21). 

A. 1.2 Eigenvalue decomposition 

If the matrix C is symmetric (m = n), 1 it can be written as an eigenvalue decomposition. 







Ao 
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c = uau t = 

u 0 


Un— 1 










^n — 1 
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[ U n-1 j 


n— 1 

= ^ ( A - 6 ) 

i = 0 

1 In this appendix, we denote symmetric matrices using C and general rectangular matrices using A. 
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Figure A.l The action of a matrix A can be visualized by thinking of the domain as being 
spanned by a set of orthonormal vectors Vj, each of which is transformed to a new orthogonal 
vector Uj with a length < When A is interpreted as a covariance matrix and its eigenvalue 
decomposition is performed, each of the Uj axes denote a principal direction (component) 
and each a :l denotes one standard deviation along that direction. 

(The eigenvector matrix U is sometimes written as <h and the eigenvectors u as <fi.) In this 
case, the eigenvalues 

Ao > Ai > • • • > A„_i (A.7) 

can be both positive and negative. 2 

A special case of the symmetric matrix C occurs when it is constructed as the sum of a 
number of outer products 

C = £ didf = AA t 7 (A. 8) 

i 

which often occurs when solving least squares problems (Appendix A. 2), where the matrix A 
consists of all the a, column vectors stacked side-by-side. In this case, we are guaranteed that 
all of the eigenvalues X, are non-negative. The associated matrix C is positive semi-definite 

x T Cx > 0, Wx. (A.9) 

If the matrix C is of full rank, the eigenvalues are all positive and the matrix is called sym- 
metric positive definite (SPD). 

Symmetric positive semi-definite matrices also arise in the statistical analysis of data, 
since they represent the covariance of a set of { x, } points around their mean x, 

C = — — x)(xi — x) T . (A. 10) 

i 

In this case, performing the eigenvalue decomposition is known as principal component anal- 
ysis (PC A), since it models the principal directions (and magnitudes) of variation of the point 

2 Eigenvalue decompositions can be computed for non-symmetric matrices but the eigenvalues and eigenvectors 
can have complex entries in that case. 
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distribution around their mean, as shown in Section 5.1.1 (5.13-5.15), Section 14.2.1 (14.9), 
and Appendix B.1.1 (B.10). Figure A.l shows how the principal components of the covari- 
ance matrix C denote the principal axes Uj of the uncertainty ellipsoid corresponding to this 
point distribution and how the a : j = denote the standard deviations along each axis. 

The eigenvalues and eigenvectors of C and the singular values and singular vectors of A 
are closely related. Given 

A = UY,V t , (A. 11) 

we get 

c = AA t = U'EV t V'EU t = UAU t . (A. 12) 

From this, we see that A i = of and that the left singular vectors of A are the eigenvectors of 
C. 

This relationship gives us an efficient method for computing the eigenvalue decomposi- 
tion of large matrices that are rank deficient, such as the scatter matrices observed in comput- 
ing eigenfaces (Section 14.2.1). Observe that the covariance matrix C in (14.9) is exactly the 
same as C in (A. 8). Note also that the individual difference-from-mean images a, = x, x 
are long vectors of length P (the number of pixels in the image), while the total number of ex- 
emplars N (the number of faces in the training database) is much smaller. Instead of forming 
C = AA T , which is P x P, we form the matrix 

C = A t A, (A. 13) 

which is N x N. (This involves taking the dot product between every pair of difference 
images a, and a r ) The eigenvalues of C are the squared singular values of A , namely S 2 , 
and are hence also the eigenvalues of C. The eigenvectors of C are the right singular vectors 
V of A, from which the desired eigenfaces U, which are the left singular vectors of A, can 
be computed as 

U = AV'Z~ 1 . (A. 14) 

This final step is essentially computing the eigenfaces as linear combinations of the difference 
images (Turk and Pentland 1991a). If you have access to a high-quality linear algebra pack- 
age such as LAPACK, routines for efficiently computing a small number of the left singular 
vectors and singular values of rectangular matrices such as A are usually provided (Ap- 
pendix C.2). However, if storing all of the images in memory is prohibitive, the construction 
of C in (A. 13) can be used instead. 

How can eigenvalue and singular value decompositions actually be computed? Notice 
that an eigenvector is defined by the equation 


A iUi = Cui or (A il — C)ui = 0. 


(A. 15) 


740 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 


(This can be derived from (A. 6) by post-multiplying both sides by u, .) Since the latter equa- 
tion is homogeneous, i.e., it has a zero right-hand-side, it can only have a non-zero (non- 
trivial) solution for Ui if the system is rank deficient, i.e., 

\(XI — C)| = 0. (A. 16) 

Evaluating this determinant yields a characteristic polynomial equation in A, which can be 
solved for small problems, e.g„ 2 x 2 or 3 x 3 matrices, in closed form. 

For larger matrices, iterative algorithms that first reduce the matrix C to a real symmetric 
tridiagonal form using orthogonal transforms and then perform QR iterations are normally 
used (Golub and Van Loan 1996; Trefethen and Bau 1997; Bjorck and Dahlquist 2010). Since 
these techniques are rather involved, it is best to use a linear algebra package such as LAPACK 
(Anderson, Bai, Bischof el al. 1999) — see Appendix C.2. 

Factorization with missing data requires different kinds of iterative algorithms, which of- 
ten involve either hallucinating the missing terms or minimizing some weighted reconstruc- 
tion metric, which is intrinsically much more challenging than regular factorization. This 
area has been widely studied in computer vision (Shum, Ikeuchi, and Reddy 1995; De la 
Torre and Black 2003; Huynh, Hartley, and Heyden 2003; Buchanan and Fitzgibbon 2005; 
Gross, Matthews, and Baker 2006; Torresani, Hertzmann, and Bregler 2008) and is some- 
times called generalized PC A. However, this term is also sometimes used to denote algebraic 
subspace clustering techniques, which is the subject of a forthcoming monograph by Vidal, 
Ma, and Sastry (2010). 

A. 1.3 QR factorization 

A widely used technique for stably solving poorly conditioned least squares problems (Bjorck 
1996) and as the basis of more complex algorithms, such as computing the SVD and eigen- 
value decompositions, is the QR factorization, 

A = QR, (A. 17) 

where Q is an orthonormal (or unitary) matrix QQ 1 = I and R is upper triangular. 3 In 
computer vision, QR can be used to convert a camera matrix into a rotation matrix and 
an upper-triangular calibration matrix (6.35) and also in various self-calibration algorithms 
(Section 7.2.2). The most common algorithms for computing QR decompositions, modified 
Gram-Schmidt, Householder transformations, and Givens rotations, are described by Golub 
and Van Loan (1996), Trefethen and Bau (1997), and Bjorck and Dahlquist (2010) and are 

3 The term "R" comes from the German name for the lower-upper (LU) decomposition, which is LR for "links” 
and “rechts” (left and right of the diagonal). 
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procedure 

Cholesky(C , R): 

R = 

C 

for i 

II 

o 

3 


for j = i + 1 . . . n — 1 


Rj,j:n-1 = Rj,j:n-1 ~ r ij r ii Ri,j:n-1 


JD ~ V 2 D 


Algorithm A.l Cholesky decomposition of the matrix C into its upper triangular form R. 


also found in LAPACK. Unlike the SVD and eigenvalue decompositions, QR factorization 
does not require iteration and can be computed exactly in 0(MN 2 + N 3 ) operations, where 
M is the number of rows and N is the number of columns (for a tall matrix). 


A. 1.4 Cholesky factorization 

Cholesky factorization can be applied to any symmetric positive definite matrix C to convert 
it into a product of symmetric lower and upper triangular matrices, 

C = LL t = R t R , (A. 18) 


where L is a lower-triangular matrix and R is an upper-triangular matrix. Unlike Gaussian 
elimination, which may require pivoting (row and column reordering) or may become un- 
stable (sensitive to roundoff errors or reordering), Cholesky factorization remains stable for 
positive definite matrices, such as those that arise from normal equations in least squares prob- 
lems (Appendix A. 2). Because of the form of (A. 18), the matrices L and R are sometimes 
called matrix square roots. 4 

The algorithm to compute an upper triangular Cholesky decomposition of C is a straight- 
forward symmetric generalization of Gaussian elimination and is based on the decomposition 
(Bjorck 1996; Golub and Van Loan 1996) 


C 


7 c 

c C n 


(A. 19) 


ay 


yl/2 

- 1/2 



1 

0 T 


' 7 l /2 

7 1 / 2 c T 


0 

C n - c 7 - 1 c t _ 


0 

I 


(A. 20) 


4 In fact, there exists a whole family of matrix square roots. Any matrix of the form LQ or QR where Q is a 
unitary matrix, is a square root of C. 
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= RqCiRq, (A. 21) 

which, through recursion, can be turned into 

C = Rq ... Rl^Rn-x ...R 0 = R t R. (A. 22) 

Algorithm A.l provides a more procedural definition, which can store the upper-triangular 
matrix R in the same space as C, if desired. The total operation count for Cholesky factor- 
ization is 0(N 3 ) for a dense matrix but can be significantly lower for sparse matrices with 
low fill-in (Appendix A. 4). 

Note that Cholesky decomposition can also be applied to block-structured matrices, where 
the term 7 in (A. 19) is now a square block sub-matrix and c is a rectangular matrix (Golub 
and Van Loan 1996). The computation of square roots can be avoided by leaving the 7 on 
the diagonal of the middle factor in (A. 20), which results in the C = LDL 1 factorization, 
where D is a diagonal matrix. However, since square roots are relatively fast on modern 
computers, this is not worth the bother and Cholesky factorization is usually preferred. 

A.2 Linear least squares 

Least squares fitting problems are pervasive in computer vision. For example, the alignment 
of images based on matching feature points involves the minimization of a squared distance 
objective function ( 6 . 2 ), 

Els = Y INI 2 = Y H/Np) - x i\\ 2 > (A.23) 

i i 

where 

G = f(xi ; p ) - x\ = x[ - H i (A. 24) 

is the residual between the measured location x \ and its corresponding current predicted lo- 
cation x[ = fix,-, p). More complex versions of least squares problems, such as large-scale 
structure from motion (Section 7.4), may involve the minimization of functions of thousands 
of variables. Even problems such as image filtering (Section 3.4.3) and regularization (Sec- 
tion 3.7.1) may involve the minimization of sums of squared errors. 

Figure A. 2a shows an example of a simple least squares line fitting problem, where the 
quantities being estimated are the line equation parameters (to, b). When the sampled vertical 
values yi are assumed to be noisy versions of points on the line y = mx + b, the optimal 
estimates for (to, b) can be found by minimizing the squared vertical residuals 

-EVls = Y \ yi ~ (' mXi + k)| 2 - 


(A. 25) 
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Note that the function being fitted need not itself be linear to use linear least squares. All that 
is required is that the function be linear in the unknown parameters. For example, polynomial 
fitting can be written as 

p 

Ep ls = 55 I Vi~ (JI a M) I 2 ’ ( A - 26 ) 

i j — 0 

while sinusoid fitting with unknown amplitude A and phase 4> (but known frequency /) can 
be written as 

E S ls = 55 ly* — ^4sin(27r/Xi + <j>) | 2 = 55 \Vi ~ ( B sin 2nfxi + Ceos 2nfxi) | 2 , (A.27) 

i i 

which is linear in (B, C). 

In general, it is more common to denote the unknown parameters using x and to write the 
general form of linear least squares as 5 

-Ells = 55 \ aiX ~ bi ^ = W Ax ~ b ^' (A.28) 

i 

Expanding the above equation gives us 

E lls = x T (A T A)x - 2 x T (A T b) + \\b\\ 2 , (A.29) 

whose minimum value for x can be found by solving the associated normal equations (Bjorck 
1996; Golub and Van Loan 1996) 

(. A T A)x = A T b. (A. 30) 


The preferred way to solve the normal equations is to use Cholesky factorization. Let 

C = A t A = R t R , (A. 31) 

where R is the upper-triangular Cholesky factor of the Hessian C, and 

d=A T b. (A.32) 

After factorization, the solution for x can be obtained as 

R t z = d, Rx = z , (A. 33) 

which involves the solution of two triangular systems, i.e., forward and backward substitution 
(Bjorck 1996). 

5 Be extra careful in interpreting the variable names here. In the 2D line-fitting example, x is used to denote the 
horizontal axis, but in the general least squares problem, x = ( m , b ) denotes the unknown parameter vector. 
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Figure A. 2 Least squares regression, (a) The line y = mx + b is fit to the four noisy data 
points, {( Xi,yt )}, denoted by x by minimizing the squared vertical residuals between the 
data points and the line, || — ( mxi + &)|| 2 . (b) When the measurements {( Xi , yi)} are 
assumed to have noise in all directions, the sum of orthogonal squared distances to the line 
JV || axi + byi + c|| 2 is minimized using total least squares. 


In cases where the least squares problem is numerically poorly conditioned (which should 
generally be avoided by adding sufficient regularization or prior knowledge about the param- 
eters, (Appendix A. 3)), it is possible to use QR factorization or SVD directly on the matrix 
A (Bjorck 1996; Golub and Van Loan 1996; Trefethen and Bau 1997; Nocedal and Wright 
2006; Bjorck and Dahlquist 2010), e.g.. 

Ax = QRx = b — ► Rx = Q 7 b. (A. 34) 

Note that the upper triangular matrices R produced by the Cholesky factorization of C = 
A 1 A and the QR factorization of A are the same, but that solving (A. 34) is generally more 
stable (less sensitive to roundoff error) but slower (by a constant factor). 

A.2.1 Total least squares 

In some problems, e.g., when performing geometric line fitting in 2D images or 3D plane 
fitting to point cloud data, instead of having measurement error along one particular axis, the 
measured points have uncertainty in all directions, which is known as the errors-in-variables 
model (Van Huffel and Lemmerling 2002; Matei and Meer 2006). In this case, it makes more 
sense to minimize a set of homogeneous squared errors of the form 

£tls = ^(aix) 2 = \\Ax\\ 2 , (A.35) 

i 

which is known as total least squares (TLS) (Van Huffel and Vandewalle 1991; Bjorck 1996; 
Golub and Van Loan 1996; Van Huffel and Lemmerling 2002). 
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The above error metric has a trivial minimum solution at x = 0 and is, in fact, homoge- 
neous in x. For this reason, we augment this minimization problem with the requirement that 
||rc|| 2 = 1. which results in the eigenvalue problem 

x = sxgum\x T (A T A)x such that ||a;|| 2 = 1. (A. 36) 

The value of x that minimizes this constrained problem is the eigenvector associated with the 
smallest eigenvalue of A 1 A. This is the same as the last right singular vector of A, since 


A = UT.V, 

a t a = veV, 

A T Av k = a 2 kl 


(A.37) 
(A. 38) 
(A. 39) 


which is minimized by selecting the smallest a k value. 

Figure A. 2b shows a line fitting problem where, in this case, the measurement errors are 
assumed to be isotropic in (x, y). The solution for the best line equation ax + by + c = 0 is 
found by minimizing 

-^TLS— 2D = ^{axi + by. + c) 2 , (A.40) 

i 

i.e., finding the eigenvector associated with the smallest eigenvalue of 6 


c = A t A = 


Xi 

Vi 

1 


Xi yi 


(A. 41) 


Notice, however, that minimizing Y^,(a,x) 2 in (A. 35) is only statistically optimal (Ap- 
pendix B.1.1) if all of the measured terms in the a,, e.g., the (x, , y, . 1) measurements, have 
equal noise. This is definitely not the case in the line-fitting example of Figure A. 2b (A.40), 
since the 1 values are noise-free. To mitigate this, we first subtract the mean x and y values 
from all the measured points 


Xi = Xi — x (A. 42) 

Vi = Vi~y (A. 43) 


and then fit the 2D line equation a(x — x) + b(y — y) = 0 by minimizing 

£ , TLS-2Dm = + ^Vi) 2 ■ (A.44) 

i 

6 Again, be careful with the variable names here. The measurement equation is a* = ( Xi , m , 1) and the unknown 
parameters are x = (a, 6, c). 
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The more general case where each individual measurement component can have different 
noise level, as is the case in estimating essential and fundamental matrices (Section 7.2), is 
called the heteroscedastic errors-in-variable (HEIV) model and is discussed by Matei and 
Meer (2006). 

A.3 Non-linear least squares 

In many vision problems, such as structure from motion, the least squares problem formulated 
in (A. 23) involves functions f(x t : p) that are not linear in the unknown parameters p. This 
problem is known as non-linear least squares or non-linear regression (Bjorck 1996; Madsen, 
Nielsen, and Tingleff 2004; Nocedal and Wright 2006). It is usually solved by iteratively re- 
linearizing (A. 23) around the current estimate of p using the gradient derivative (Jacobian) 
J = df /dp and computing an incremental improvement A p. 

As shown in Equations (6.13-6.17), this results in 

.Enls(Ap) = ^ Wf( x ^P+ A P) - x 'i\\ 2 (A. 45) 

i 

~ ^2 ll J (*i;P)Ap- fill 2 , (A. 46) 

i 

where the Jacobians J (xi ; p) and residual vectors r , play the same role in forming the normal 
equations as and 6, in (A. 28). 

Because the above approximation only holds near a local minimum or for small values 
of A p, the update p <— p + A p may not always decrease the summed square residual error 
(A. 45). One way to mitigate this problem is to take a smaller step, 

p <— p + aAp, 0 < a < 1. (A. 47) 

A simple way to determine a reasonable value of a is to start with 1 and successively halve 
the value, which is a simple form of line search (Al-Baali and Fletcher. 1986; Bjorck 1996; 
Nocedal and Wright 2006). 

Another approach to ensuring a downhill step in error is to add a diagonal damping term 


to the approximate Hessian 

c = J2 jT 

(A. 48) 

i.e., to solve 

[C + A diag(C)]Ap = d, 

(A. 49) 

where 

d = '22 J T ( x i) r i, 

(A. 50) 
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which is called a damped Gauss-Newton method. The damping parameter A is increased if 
the squared residual is not decreasing as fast as expected, i.e., as predicted by (A. 46), and 
is decreased if the expected decrease is obtained (Madsen, Nielsen, and Tingleff 2004). The 
combination of the Newton (first-order Taylor series) approximation (A.46) and the adaptive 
damping parameter A is commonly known as the Levenberg-Marquardt algorithm (Leven- 
berg 1944; Marquardt 1963) and is an example of more general trust region methods, which 
are discussed in more detail in (Bjorck 1996; Conn, Gould, and Toint 2000; Madsen, Nielsen, 
and Tingleff 2004; Nocedal and Wright 2006). 

When the initial solution is far away from its quadratic region of convergence around a 
local minimum, large residual methods , e.g., Newton-type methods, which add a second-order 
term to the Taylor series expansion in (A.46), may converge faster. Quasi-Newton methods 
such as BFGS, which require only gradient evaluations, can also be useful if memory size is 
an issue. Such techniques are discussed in textbooks and papers on numerical optimization 
(Toint 1987; Bjorck 1996; Conn, Gould, and Toint 2000; Nocedal and Wright 2006). 


A.4 Direct sparse matrix techniques 

Many optimization problems in computer vision, such as bundle adjustment (Szeliski and 
Kang 1994; Triggs, McLauchlan, Hartley el al. 1999; Hartley and Zisserman 2004; Snavely, 
Seitz, and Szeliski 2008b; Agarwal, Snavely, Simon et al. 2009) have Jacobian and (approx- 
imate) Hessian matrices that are extremely sparse (Section 7.4.1). For example. Figure 7.9a 
shows the bipartite model typical of structure from motion problems, in which most points 
are only observed by a subset of the cameras, which results in the sparsity patterns for the 
Jacobian and Hessian shown in Figure 7.9b-c. 

Whenever the Hessian matrix is sparse enough, it is more efficient to use sparse Cholesky 
factorization instead of regular Cholesky factorization. In such sparse direct techniques, the 
Hessian matrix C and its associated Cholesky factor R are stored in compressed form, in 
which the amount of storage is proportional to the number of (potentially) non-zero entries 
(Bjorck 1996; Davis 2006). 7 Algorithms for computing the non-zero elements in C and R 
from the sparsity pattern of the Jacobian matrix J are given by Bjorck (1996, Section 6.4), 
and algorithms for computing the numerical Cholesky and QR decompositions (once the 
sparsity pattern has been computed and storage allocated) are discussed by Bjorck (1996, 
Section 6.5). 

7 For example, you can store a list of {i,j, c, :j ) triples. One example of such a scheme is compressed sparse 
row (CSR) storage. An alternative storage method called skyline, which stores adjacent vertical spans of non-zero 
elements (Bathe 2007), is sometimes used in finite element analysis. Banded systems such as snakes (5.3) can store 
just the non-zero band elements (Bjorck 1996, Section 6.2) and can be solved in 0(nb 2 ), where n is the number of 
variables and b is the bandwidth. 
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A.4.1 Variable reordering 

The key to efficiently solving sparse problems using direct (non-iterative) techniques is to 
determine an efficient ordering for the variables, which reduces the amount of fill-in, i.e., the 
number of non-zero entries in R that were zero in the original C matrix. We already saw in 
Section 7.4.1 how storing the more numerous 3D point parameters before the camera param- 
eters and using the Schur complement (7.56) results in a more efficient algorithm. Similarly, 
sorting parameters by time in video-based reconstruction problems usually results in lower 
fill-in. Furthermore, any problem whose adjacency graph (the graph corresponding to the 
sparsity pattern) is a tree can be solved in linear time with an appropriate reordering of the 
variables (putting all the children before their parents). All of these are examples of good 
reordering techniques. 

In the general case of unstructured data, there are many heuristics available to find good 
reorderings (Bjorck 1996; Davis 2006). 8 For general adjacency (sparsity) graphs, minimum 
degree orderings generally produce good results. For planar graphs, which often arise on im- 
age or spline grids (Section 8.3), nested dissection , which recursively splits the graph into two 
equal halves along a frontier (or boundary) of small size, generally works well. Such domain 
decomposition (or multi-frontal) techniques also enable the use of parallel processing, since 
independent sub-graphs can be processed in parallel on separate processors (Davis 2008). 

The overall set of steps used to perform the direct solution of sparse least squares problems 
are summarized in Algorithm A.2, which is a modified version of Algorithm 6.6.1 by Bjorck 
(1996, Section 6.6)). If a series of related least squares problems is being solved, as is the 
case in iterative non-linear least squares (Appendix A. 3), steps 1-3 can be performed ahead of 
time and reused for each new invocation with different C and d values. When the problem is 
block-structured, as is the case in structure from motion where point (structure) variables have 
dense 3x3 sub-entries in C and cameras have 6 x 6 (or larger) entries, the cost of performing 
the reordering computation is small compared to the actual numerical factorization, which 
can benefit from block-structured matrix operations (Golub and Van Loan 1996). It is also 
possible to apply sparse reordering and multifrontal techniques to QR factorization (Davis 
2008), which may be preferable when the least squares problems are poorly conditioned. 


A. 5 Iterative techniques 

When problems become large, the amount of memory required to store the Hessian matrix 
C and its factor R, and the amount of time it takes to compute the factorization, can be- 
come prohibitively large, especially when there are large amounts of fill-in. This is often 

8 Finding the optimal reordering with minimal fill-in is provably NP-hard. 
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procedure SparseCholeskySolve{C , d ) : 

1. Determine symbolically the structure of C, i.e., the adjacency graph. 

2. (Optional) Compute a reordering for the variables, taking into ac- 
count any block structure inherent in the problem. 

3. Determine the fill-in pattern for R and allocate the compressed stor- 
age for R as well as storage for the permuted right hand side d. 

4. Copy the elements of C and d into R and d, permuting the values 
according to the computed ordering. 

5. Perform the numerical factorization of R using Algorithm A. 1. 

6. Solve the factored system (A. 33), i.e., 

R t z = d, Rx = z. 

7. Return the solution x, after undoing the permutation. 


Algorithm A.2 Sparse least squares using a sparse Cholesky decomposition of the matrix 
C. 

the case with image processing problems defined on pixel grids, since, even with the optimal 
reordering (nested dissection) the amount of fill can still be large. 

A preferable approach to solving such linear systems is to use iterative techniques, which 
compute a series of estimates that converge to the final solution, e.g., by taking a series of 
downhill steps in an energy function such as (A. 29). 

A large number of iterative techniques have been developed over the years, including such 
well-known algorithms as successive overrelaxation and multi-grid. These are described in 
specialized textbooks on iterative solution techniques (Axelsson 1996; Saad 2003) as well as 
in more general books on numerical linear algebra and least squares techniques (Bjorck 1996; 
Golub and Van Loan 1996; Trefethen and Bau 1997; Nocedal and Wright 2006; Bjorck and 
Dahlquist 2010). 


A.5.1 Conjugate gradient 

The iterative solution technique that often performs best is conjugate gradient descent, which 
takes a series of downhill steps that are conjugate to each other with respect to the C matrix. 
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i.e., if the u and v descent directions satisfy u T Cv = 0. In practice, conjugate gradient 
descent outperforms other kinds of gradient descent algorithm because its convergence rate 
is proportional to the square root of the condition number of C instead of the condition 
number itself. 9 Shewchuk (1994) provides a nice introduction to this topic, with clear intuitive 
explanations of the reasoning behind the conjugate gradient algorithm and its performance. 

Algorithm A. 3 describes the conjugate gradient algorithm and its related least squares 
counterpart, which can be used when the original set of least squares linear equations are 
available in the form of Ax = b (A. 28). While it is easy to convince yourself that the two 
forms are mathematically equivalent, the least squares form is preferable if rounding errors 
start to affect the results because of poor conditioning. It may also be preferable if, due to 
the sparsity structure of A, multiplies with the original A matrix are faster or more space 
efficient than multiplies with C. 

The conjugate gradient algorithm starts by computing the current residual r o = d — Cx q, 
which is the direction of steepest descent of the energy function (A. 28). It sets the original 
descent direction p 0 = Tq. Next, it multiplies the descent direction by the quadratic form 
(Hessian) matrix C and combines this with the residual to estimate the optimal step size a k . 
The solution vector x k and the residual vector r k are then updated using this step size. (No- 
tice how the least squares variant of the conjugate gradient algorithm splits the multiplication 
by the C = A 1 A matrix across steps 4 and 8.) Finally, a new search direction is calculated 
by first computing a factor (3 as the ratio of current to previous residual magnitudes. The 
new search direction p k+1 is then set to the residual plus (3 times the old search direction p k , 
which keeps the directions conjugate with respect to C. 

It turns out that conjugate gradient descent can also be directly applied to non-quadratic 
energy functions, e.g., those arising from non-linear least squares (Appendix A.3). Instead 
of explicitly forming a local quadratic approximation C and then computing residuals r k , 
non-linear conjugate gradient descent computes the gradient of the energy function E (A. 45) 
directly inside each iteration and uses it to set the search direction (Nocedal and Wright 2006). 
Since the quadratic approximation to the energy function may not exist or may be inaccurate, 
line search is often used to determine the step size a k . Furthermore, to compensate for errors 
in finding the true function minimum, alternative formulas for 0 k +i such as Polak-Ribiere, 


Pk+l 


X7E(x k+1 )[X7E(x k+1 ) - S7E(x k )] 

|| V^7(rc fe )|| 2 


(A. 51) 


are often used (Nocedal and Wright 2006). 


9 The condition number /■c(C') is the ratio of the largest and smallest eigenvalues of C. The actual convergence 
rate depends on the clustering of the eigenvalues, as discussed in the references cited in this section. 
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ConjugateGradient(C , d , Xq) 

1. r 0 = d — Cx o 

2. Po = To 

3. for k = 0 . . . 

4. w k = Cp k 

5. a k = \\r k \\ 2 /(p k -w k ) 

6- = a?/, -T a k p k 

7. r fc+ i = r k - a k w k 

8 . 

9. ft +1 = |K +1 || 2 /||r fc || 2 

10. p k+1 = r k+1 + (3 k p k 


ConjugateGradientLS(A , b, Xq) 

1. q Q = b- Ax o, r 0 = A T q Q 

2. P 0 = r Q 

3. for k = 0 . . . 

4. v k = Ap k 

5. a fc = ||r fe || 2 /||t; fc || 2 

X k +i = + CX k p k 

7- q fe +i = q k ~ oikV k 

8. r k+1 = A T q k+1 

9. (3 k+1 = ||r fc+1 || 2 /||r fe || 2 

10 - Pk+i = r k+ 1 + PkP k 


Algorithm A.3 Conjugate gradient and conjugate gradient least squares algorithms. The 
algorithm is described in more detail in the text, but in brief, they choose descent directions 
p k that are conjugate to each other with respect to C by computing a factor 6 by which to 
discount the previous search direction p k __ . They then find the optimal step size a and take 
a downhill step by an amount a k p k . 


A.5.2 Preconditioning 

As we mentioned previously, the rate of convergence of the conjugate gradient algorithm 
is governed in large part by the condition number k(C). Its effectiveness can therefore be 
increased dramatically by reducing this number, e.g., by rescaling elements in x, which cor- 
responds to rescaling rows and columns in C. 

In general, preconditioning is usually thought of as a change of basis from the vector x to 
a new vector 

x = Sx. (A.52) 


The corresponding linear system being solved then becomes 

AS~ 1 x = S~ 1 b or Ax = b, 


(A. 53) 
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with a corresponding least squares energy (A. 29) of the form 

Epls = x t (S , " t CS'~ 1 )£’ - 2 x T (S~ T d) + ||t>|| 2 . (A. 54) 

The actual preconditioned matrix C = S r CS 1 is usually not explicitly computed. In- 
stead, Algorithm A. 3 is extended to insert S~ T and S T operations at the appropriate places 
(Bjorck 1996; Golub and Van Loan 1996; Trefethen and Bau 1997; Saad 2003; Nocedal and 
Wright 2006). 

A good preconditioner S is easy and cheap to compute, but is also a decent approximation 
to a square root of C, so that n(S~ T C S^ 1 ) is closer to 1. The simplest such choice is the 
square root of the diagonal matrix S = D 1 1 , with D = diag(C). This has the advantage 
that any scalar change in variables (e.g., using radians instead of degrees for angular measure- 
ments) has no effect on the range of convergence of the iterative technique. For problems that 
are naturally block-structured, e.g., for structure from motion, where 3D point positions or 
6D camera poses are being estimated, a block diagonal preconditioner is often a good choice. 

A wide variety of more sophisticated preconditioners have been developed over the years 
(Bjorck 1996; Golub and Van Loan 1996; Trefethen and Bau 1997; Saad 2003; Nocedal and 
Wright 2006), many of which can be directly applied to problems in computer vision (Byrod 
and (iAstrom 2009; Jeong, Nister, Steedly et al. 2010; Agarwal, Snavely, Seitz et al. 2010). 
Some of these are based on an incomplete Cholesky factorization of C, i.e., one in which the 
amount of fill-in in R is strictly limited, e.g., to just the original non-zero elements in C. 10 
Other preconditioners are based on a sparsified, e.g., tree-based or clustered, approximation 
to C (Koutis 2007; Koutis and Miller 2008; Grady 2008; Koutis, Miller, and Tolliver 2009), 
since these are known to have efficient inversion properties. 

For grid-based image-processing applications, parallel or hierarchical preconditioners 
often perform extremely well (Yserentant 1986; Szeliski 1990b; Pentland 1994; Saad 2003; 
Szeliski 2006b). These approaches use a change of basis transformation S that resembles 
the pyramidal or wavelet representations discussed in Section 3.5, and are hence amenable 
to parallel and GPU-based implementations. Coarser elements in the new representation 
quickly converge to the low-frequency components in the solution, while finer-level elements 
encode the higher-frequency components. Some of the relationships between hierarchical 
preconditioners, incomplete Cholesky factorization, and multigrid techniques are explored 
by Saad (2003) and Szeliski (2006b). 


10 If a complete Cholesky factorization C = R T R is used, we get C = R T CR 1 = I and all iterative 
algorithms converge in a single step, thereby obviating the need to use them, but the complete factorization is often 
too expensive. Note that incomplete factorization can also benefit from reordering. 
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A.5.3 Multigrid 

One other class of iterative techniques widely used in computer vision is multigrid techniques 
(Briggs, Henson, and McCormick 2000; Trottenberg, Oosterlee, and Schuller 2000), which 
have been applied to problems such as surface interpolation (Terzopoulos 1986a), optical 
flow (Terzopoulos 1986a; Bruhn, Weickert, Kohlberger et al. 2006), high dynamic range tone 
mapping (Fattal, Lischinski, and Werman 2002), colorization (Levin, Lischinski, and Weiss 
2004), natural image matting (Levin, Lischinski, and Weiss 2008), and segmentation (Grady 
2008). 

The main idea behind multigrid is to form coarser (lower-resolution) versions of the prob- 
lems and use them to compute the low-frequency components of the solution. However, 
unlike simple coarse-to-fine techniques, which use the coarse solutions to initialize the fine 
solution, multigrid techniques only correct the low-frequency component of the current solu- 
tion and use multiple rounds of coarsening and refinement (in what are often called “V” and 
“W” patterns of motion across the pyramid) to obtain rapid convergence. 

On certain simple homogeneous problems (such as solving Poisson equations), multigrid 
techniques can achieve optimal performance, i.e., computation times linear in the number 
of variables. However, for more inhomogeneous problems or problems on irregular grids, 
variants on these techniques, such as algebraic multigrid (AMG) approaches, which look at 
the structure of C to derive coarse level problems, may be preferable. Saad (2003) has a 
nice discussion of the relationship between multigrid and parallel preconditioners and on the 
relative merits of using multigrid or conjugate gradient approaches. 
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The following problem commonly recurs in this book: Given a number of measurements 
(images, feature positions, etc.), estimate the values of some unknown structure or parameter 
(camera positions, object shape, etc.). These kinds of problems are in general called inverse 
problems because they involve estimating unknown model parameters instead of simulating 
the forward formation equations. 1 Computer graphics is a classic forward modeling problem 
(given some objects, cameras, and lighting, simulate the images that would result), while 
computer vision problems are usually of the inverse kind (given one or more images, recover 
the scene that gave rise to these images). 

Given an instance of an inverse problem, there are, in general, several ways to proceed. 
For instance, through clever (or sometimes straightforward) algebraic manipulation, a closed 
form solution for the unknowns can sometimes be derived. Consider, for example, the camera 
matrix calibration problem (Section 6.2.1): given an image of a calibration pattern consisting 
of known 3D point positions, compute the 3x4 camera matrix P that maps these points onto 
the image plane. 

In more detail, we can write this problem as (6.33-6.34) 


PooXi + Pol Yi + P 02 Zi + P03 
Xi = 

P2oXi + P21G + P22Z1 + P23 

(B.l) 

PioXi + PnYi + p 12 Zi + P13 

(B.2) 

P20X1 + P21Y1 + P22Zi + P23 ’ 


where (xi, yi) is the feature position of the ith point measured in the image plane, ( Xi , Yi, Zi ) 
is the corresponding 3D point position, and the p,; ; are the unknown entries of the camera 
matrix P. Moving the denominator over to the left hand side, we end up with a set of 
simultaneous linear equations, 

Xi{p 2 oXi + P2lY l + P22Zi + P23) = PooXi + poiYi + p 02 Zi + P03, (B.3) 

yi(p2oXi + p 2 iYi + P22Z i + P23) = PwXi + pnYi + p 12 Zi + p 13 , (B.4) 

which we can solve using linear least squares (Appendix A. 2) to obtain an estimate of P. 

The question then arises: is this set of equations the right ones to be solving? If the 
measurements are totally noise-free or we do not care about getting the best possible answer, 
then the answer is yes. However, in general, we cannot be sure that we have a reasonable 
algorithm unless we make a model of the likely sources of error and devise an algorithm that 
performs as well as possible given these potential errors. 


1 In machine learning, these problems are called regression problems , because we are trying to estimate a contin- 
uous quantity from noisy inputs, as opposed to a discrete classification task (Bishop 2006). 
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B.l Estimation theory 

The study of such inference problems from noisy data is often called estimation theory (Gelb 
1974), and its extension to problems where we explicitly choose a loss function is called sta- 
tistical decision theory (Berger 1993; Hastie, Tibshirani, and Friedman 2001; Bishop 2006; 
Robert 2007). We first start by writing down the forward process that leads from our un- 
knowns (and knowns) to a set of noise-corrupted measurements. We then devise an algorithm 
that will give us an estimate (or set of estimates) that are both insensitive to the noise (as best 
they can be) and also quantify the reliability of these estimates. 

The specific equations above (B.l) are just a particular instance of a more general set of 
measurement equations, 

Vi = /»(*) + «*. (B.5) 

Here, the y i are the noise-corrupted measurements, e.g., (xi, yi ) in Equation (B.l), and x is 
the unknown state vector? 

Each measurement comes with its associated measurement model f t (x), which maps the 
unknown into that particular measurement. An alternative formulation would be to have one 
general function f(x, p?) and to use a per-measurement parameter vector p t to distinguish 
between different measurements, e.g., (Xi, Yi, Z?) in Equation (B.l). Note that the use of the 
f i( x) form makes it straightforward to have measurements of different dimensions, which 
becomes useful when we start adding in prior information (Appendix B.4). 

Each measurement is also contaminated with some noise n,. In Equation (B.5), we have 
indicated that n, is a zero-mean normal (Gaussian) random variable with a covariance matrix 
X,. In general, the noise need not be Gaussian and, in fact, it is usually prudent to assume 
that some measurements may be outliers. However, we defer this discussion to Appendix B.3, 
after we have explored the simpler Gaussian noise case more fully. We also assume that the 
noise vectors n, are independent. In the case where they are not (e.g., when some constant 
gain or offset contaminates all of the pixels in a given image), we can add this effect as a 
nuisance parameter to our state vector x and later estimate its value (and discard it, if so 
desired). 

B.1.1 Likelihood for multivariate Gaussian noise 

Given all of the noisy measurements y — { y , } , we would like to infer a probability distribu- 
tion on the unknown x vector. We can write the likelihood of having observed the {y, } given 
a particular value of x as 

L = p(y\x) = \ \p(Vi\x) = Y[p(yi\fi(x)) = JJp(ni). (B.6) 

i i i 

2 In the Kalman filtering literature (Gelb 1974), it is more common to use z instead of y to denote measurements. 
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When each noise vector rii is a multivariate Gaussian with covariance Ej, 

ni~jV(0,53i), (B.7) 

we can write this likelihood as 

L = J||2 7 rE i |- 1/2 exp^-i(y i -/ i (a;)) T E l “ 1 (y i -/ i (x))^ (B.8) 

= ni 2 ^r 1/2ex P - 

i ' ' 

where the matrix norm ||cc||^ is a shorthand notation for x 7 Ax. 

The norm \\y i — jj-i is often called the Mahalanobis distance (5.26 and 14.14)and is 
used to measure the distance between a measurement and the mean of a multivariate Gaussian 
distribution. Contours of equal Mahalanobis distance are equi -probability contours. Note 
that when the measurement covariance is isotropic (the same in all directions), i.e., when 
E, = of J, the likelihood can be written as 

L = J|(27ro-f)- Ar */ 2 exp i-^sWVi ~ /i(*)l| 2 ) , ( B -9) 

where Ni is the length of the / th measurement vector y t . 

We can more easily visualize the structure of the covariance matrix and the correspond- 
ing Mahalanobis distance if we first perform an eigenvalue or principal component analysis 
(PCA) of the covariance matrix (A. 6), 

E = tf> diag(Ao . . . Xn-i) (B.10) 

Equal-probability contours of the corresponding multi-variate Gaussian, which are also equi- 
distance contours in the Mahalanobis distance (Figure 14.14), are multi-dimensional ellip- 
soids whose axis directions are given by the columns of 3> (the eigenvectors ) and whose 
lengths are given by the ctj = ^/Af (Figure A.l). 

It is usually more convenient to work with the negative log likelihood, which we can think 
of as a cost or energy 

E = - log L = + (B. 11) 

i 

= \^2\\Vi- (B.12) 

i 

where k = ( log |27rE, | is a constant that depends on the measurement variances, but is 

independent of x. 
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Notice that the inverse covariance Ci = S” 1 plays the role of a weight on each of the 
measurement error residuals , i.e., the difference between the contaminated measurement y t 
and its uncontaminated (predicted) value f fx). In fact, the inverse covariance is often called 
the (Fisher) information matrix (Bishop 2006), since it tells us how much information is 
contained in a given measurement, i.e., how well it constrains the final estimate. We can also 
think of this matrix as denoting the amount of confidence to associate with each measurement 
(hence the letter C ). 

In this formulation, it is quite acceptable for some information matrices to be singular 
(of degenerate rank) or even zero (if the measurement is missing altogether). Rank-deficient 
measurements often occur, for example, when using a line feature or edge to measure a 3D 
edge-like feature, since its exact position along the edge is unknown (of infinite or extremely 
large variance) §8.1.3. 

In order to make the distinction between the noise contaminated measurement and its 
expected value for a particular setting of x more explicit, we adopt the notation y for the 
former (think of the tilde as the approximate or noisy value) and y = ffx) for the latter 
(think of the hat as the predicted or expected value). We can then write the negative log 
likelihood as 

E= -log!, = ^||y i -y i || s -i + k. (B.13) 

i 

B.2 Maximum likelihood estimation and least squares 

Now that we have presented the likelihood and log likelihood functions, how can we find the 
optimal value for our state estimate xl One plausible choice might be to select the value of x 
that maximizes L = p(y\x). In fact, in the absence of any prior model for x (Appendix B.4), 
we have 

L = p(y\x) = p(y, x) = p(x\y). 

Therefore, choosing the value of x that maximizes the likelihood is equivalent to choosing 
the maximum of our probability density estimate for x. 

When might this be a good idea? If the data (measurements) constrain the possible values 
of x so that they all cluster tightly around one value (e.g., if the distribution p(x\y) is a 
unimodal Gaussian), the maximum likelihood estimate is the optimal one in that it is both 
unbiased and has the least possible variance. In many other cases, e.g., if a single estimate 
is all that is required, it is still often the best estimate. 3 However, if the probability is multi- 
modal, i.e., it has several local minima in the log likelihood (Figure 5.7), much more care 

3 According to the Gauss-Markov theorem, least squares produces the best linear unbiased estimator (BLUE) for 
a linear measurement model regardless of the actual noise distribution, assuming that the noise is zero mean and 
uncorrelated. 
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may be required. In particular, it might be necessary to defer certain decisions (such as the 
ultimate position of an object being tracked) until more measurements have been taken. The 
CONDENSATION algorithm presented in Section 5.1.2 is one possible method for modeling 
and updating such multi-modal distributions but is just one example of more general particle 
filtering and Markov Chain Monte Carlo (MCMC) techniques (Andrieu, de Freitas, Doucet 
et al. 2003; Bishop 2006; Koller and Friedman 2009). 

Another possible way to choose the best estimate is to maximize the expected utility 
(or, conversely, to minimize the expected risk or loss) associated with obtaining the correct 
estimate, i.e., by minimizing 


For example, if a robot wants to avoid hitting a wall at all costs, the loss function will be 
high whenever the estimate underestimates the true distance to the wall. When l(x — y) = 
5(x — y ), we obtain the maximum likelihood estimate, whereas when l( x — y) = \\x — y || 2 , 
we obtain the mean square error (MSE) or expected value estimate. The explicit modeling of 
a utility or loss function is what characterizes statistical decision theory (Berger 1993; Hastie, 
Tibshirani, and Friedman 2001; Bishop 2006; Robert 2007). 

How do we find the maximum likelihood estimate? If the measurement noise is Gaussian, 
we can minimize the quadratic objective function (B.13). This becomes even simpler if the 
measurement equations are linear, i.e.. 


where H is the measurement matrix relating unknown state variables x to measurements y. 
In this case, (B.13) becomes 


which is a simple quadratic form in x, which can be solved using linear least squares (Ap- 
pendix A. 2). When the measurements are non-linear, the system must be solved iteratively 
using non-linear least squares (Appendix A. 3). 

B.3 Robust statistics 

In Appendix B.1.1, we assumed that the noise being added to each measurement (B.5) was 
multivariate Gaussian (B.7). This is an appropriate model if the noise is the result of lots of 
tiny errors being added together, e.g., from thermal noise in a silicon imager. In most cases, 
however, measurements can be contaminated with larger outliers, i.e., gross failures in the 



(B. 14) 


fi( x ) = H i x , 


(B.15) 


E = ]T II Vi - H t x\ | S -1 = ~ HixfCiiVi - H iX ), (B.16) 
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measurement process. Examples of such outliers include bad feature matches (Section 6.1.4), 
occlusions in stereo matching (Chapter 1 1), and discontinuities in an otherwise smooth image, 
depth map, or label image (Sections 3.7.1 and 3.7.2). 

In such cases, it makes more sense to model the measurement noise with a long-tailed 
contaminated noise model such as a Laplacian. The negative log likelihood in this case, 
rather than being quadratic in the measurement residuals (B.12-B.16), has a slower growth 
in the penalty function to account for the increased likelihood of large errors. 

This formulation of the inference problem is called an M-estimator in the robust statistics 
literature (Huber 1981; Hampel, Ronchetti, Rousseeuw et al. 1986; Black and Rangarajan 
1996; Stewart 1999) and involves applying a robust penalty function p(r) to the residuals 

£WAp) = 5>(INI) (B.17) 


instead of squaring them. 

As we mentioned in Section 6.1.4, we can take the derivative of this function with respect 
to p and set it to 0, 




d|NI 

dp 


V- V’(INI) T dr i = n 
4- INI 1 d P 


(B. 18) 


where f>{r) = p'(r) is the derivative of p and is called the influence function. If we introduce a 
weight function, w(r) = 'T(r)/r, we observe that finding the stationary point of (B.17) using 
(B.18) is equivalent to minimizing the iteratively re-weighted least squares (IRLS) problem 


£irls = 5>(IN|)|M 2 , (B.19) 

where the ru(||rj||) play the same local weighting role as C7* = in (B.12). Black and 
Anandan (1996) describe a variety of robust penalty functions and their corresponding influ- 
ence and weighting function. 

The IRLS algorithm alternates between computing the influence functions ru(||ri||) and 
solving the resulting weighted least squares problem (with fixed w values). Alternative in- 
cremental robust least squares algorithms can be found in the work of Sawhney and Ayer 
(1996); Black and Anandan (1996); Black and Rangarajan (1996); Baker, Gross, Ishikawa et 
al. (2003) and textbooks and tutorials on robust statistics (Huber 1981; Hampel, Ronchetti, 
Rousseeuw et al. 1986; Rousseeuw and Leroy 1987; Stewart 1999). It is also possible to ap- 
ply general optimization techniques (Appendix A. 3) directly to the non-linear cost function 
given in Equation (B.19), which may sometimes have better convergence properties. 

Most robust penalty functions involve a scale parameter, which should typically be set to 
the variance (or standard deviation, depending on the formulation) of the non-contaminated 
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(inlier) noise. Estimating such noise levels directly from the measurements or their residuals, 
however, can be problematic, as such estimates themselves become contaminated by outliers. 
The robust statistics literature contains a variety of techniques to estimate such parameters. 
One of the simplest and most effective is the median absolute deviation (MAD), 

MAD = med,;||r,;||, (B.20) 

which, when multiplied by 1 .4, provides a robust estimate of the standard deviation of the 
inlier noise process. 

As mentioned in Section 6.1.4, it is often better to start iterative non-linear minimiza- 
tion techniques, such as IRLS, in the vicinity of a good solution by first randomly selecting 
small subsets of measurements until a good set of inliers is found. The best known of these 
techniques is RANdom SAmple Consensus (RANSAC) (Fischler and Bolles 1981), although 
even better variants such as Preemptive RANSAC (Nister 2003) and PROgressive SAmple 
Consensus (PROSAC) (Chum and Matas 2005) have since been developed. 

B.4 Prior models and Bayesian inference 

While maximum likelihood estimation can often lead to good solutions, in some cases the 
range of possible solutions consistent with the measurements is too large to be useful. For 
example, consider the problem of image denoising (Sections 3.4.4 and 3.7.3). If we esti- 
mate each pixel separately based on just its noisy version, we cannot make any progress, 
as there are a large number of values that could lead to each noisy measurement. 4 Instead, 
we need to rely on typical properties of images, e.g., that they tend to be piecewise smooth 
(Section 3.7.1). 

The propensity of images to be piecewise smooth can be encoded in a prior distribution 
p(x), which measures the likelihood of an image being a natural image. For example, to 
encode piecewise smoothness, we can use a Markov random field model (3.109 and B.24) 
whose negative log likelihood is proportional to a robustified measure of image smoothness 
(gradient magnitudes). 

Prior models need not be restricted to image processing applications. For example, we 
may have some external knowledge about the rough dimensions of an object being scanned, 
the focal length of a lens being calibrated, or the likelihood that a particular object might 
appear in an image. All of these are examples of prior distributions or probabilities and they 
can be used to produce more reliable estimates. 

As we have already seen in (3.68) and (3. 106), Bayes’ Rule states that a posterior distribu- 
tion p(x\y) over the unknowns x given the measurements y can be obtained by multiplying 

4 In fact, the maximum likelihood estimate is just the noisy image itself. 
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the measurement likelihood p(y \x) by the prior distribution p(x), 

= 

pyy ) 

where p(y) = f x p(yjx)p(x) is a normalizing constant used to make the p(x\y) distribution 
proper (integrate to 1). Taking the negative logarithm of both sides of Equation (B.21), we 
get 

- log p{x\y) = - log p(y\x) - log p(x) + log p(y), (B.22) 

which is the negative posterior log likelihood. It is common to drop the constant log p(y) be- 
cause its value does not matter during energy minimization. However, if the prior distribution 
p(x) depends on some unknown parameters, we may wish to keep log p(y) in order to com- 
pute the most likely value of these parameters using Occam’s razor, i.e., by maximizing the 
likelihood of the observations, or to select the correct number of free parameters using model 
selection (Hastie, Tibshirani, and Friedman 2001; Torr 2002; Bishop 2006; Robert 2007). 

To find the most likely ( maximum a posteriori or MAP) solution x given some measure- 
ments y, we simply minimize this negative log likelihood, which can also be thought of as an 
energy , 

E(x,y) = E d (x,y) + E p (x). (B.23) 

The first term Ed{x,y) is the data energy or data penalty and measures the negative log 
likelihood that the measurements y were observed given the unknown state x. The second 
term E p (x) is the prior energy and it plays a role analogous to the smoothness energy in 
regularization. Note that the MAP estimate may not always be desirable, since it selects the 
“peak” in the posterior distribution rather than some more stable statistic such as MSE — see 
the discussion in Appendix B.2 about loss functions and decision theory. 


B.5 Markov random fields 

Markov random fields (Blake, Kohli, and Rother 2010) are the most popular types of prior 
model for gridded image-like data, 5 which include not only regular natural images (Sec- 
tion 3.7.2) but also two-dimensional fields such as optic flow (Chapter 8) or depth maps 
(Chapter 11), as well as binary fields, such as segmentations (Section 5.5). 

As we discussed in Section 3.7.2, the prior probability p(x) for a Markov random field is 
a Gibbs or Boltzmann distribution, whose negative log likelihood (according to the Hammer- 

5 Alternative formulations include power spectra (Section 3.4.3) and non-local means (Buades, Coll, and Morel 
2008). 


764 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 



Figure B.l Graphical model for an A /4 neighborhood Markov random field. The white 
circles are the unknowns f(i,j), while the dark circles are the input data The s x (i,j ) 

and s y (i, j ) black boxes denote arbitrary interaction potentials between adjacent nodes in the 
random field, and the w(i, j) denote the data penalty functions. They are all examples of the 
general potentials j), f(k , /)) used in Equation (B.24). 

sley-Clifford Theorem) can be written as a sum of pairwise interaction potentials, 

E P (*) = ^ (B.24) 

where Af(i,j) denotes the neighbors of pixel In the more general case, MRFs can also 
contain unary potentials, as well as higher-order potentials defined over larger cardinality 
cliques (Kindermann and Snell 1980; Geman and Geman 1984; Bishop 2006; Potetz and Lee 
2008; Kohli, Kumar, and Torr 2009; Kohli, Ladicky, and Torr 2009; Rother, Kohli, Feng et 
al. 2009; Alahari, Kohli, and Torr 201 1). They can also contain line processes, i.e., additional 
binary variables that mediate discontinuities between adjacent elements (Geman and Geman 
1984). Black and Rangarajan (1996) show how independent line process variables can be 
eliminated and incorporated into regular MRFs using robust pairwise penalty functions. 

The most commonly used neighborhood in Markov random field modeling is the A /4 
neighborhood, where each pixel in the field interacts only with its immediate neighbors — 

Figure B.l shows such an A /4 MRF. The s x (i,j) and s y (i,j) black boxes denote arbitrary 
interaction potentials between adjacent nodes in the random field and the w(i,j) denote the 
elemental data penalty terms in Ed (B.23). These square nodes can also be interpreted as, fac- 
tors in a factor graph version of the undirected graphical model (Bishop 2006; Wainwright 
and Iordan 2008; Koller and Friedman 2009), which is another name for interaction poten- 
tials. (Strictly speaking, the factors are improper probability functions whose product is the 
un-normalized posterior distribution.) 

More complex and higher-dimensional interaction models and neighborhoods are also 
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possible. For example, 2D grids can be enhanced with the addition of diagonal connections 
(an J\f s neighborhood) or even larger numbers of pairwise terms (Boykov and Kolmogorov 
2003; Rother, Kolmogorov, Lempitsky et al. 2007). 3D grids can be used to compute glob- 
ally optimal segmentations in 3D volumetric medical images (Boykov and Funka-Lea 2006) 
(Section 5.5.1). Higher-order cliques can also be used to develop more sophisticated models 
(Potetz and Lee 2008; Kohli, Ladicky, and Torr 2009; Kohli, Kumar, and Torr 2009). 

One of the biggest challenges in using MRF models is to develop efficient inference algo- 
rithms that will find low-energy solutions (Veksler 1999; Boykov, Veksler, and Zabih 2001; 
Kohli 2007; Kumar 2008). Over the years, a large variety of such algorithms have been de- 
veloped, including simulated annealing, graph cuts, and loopy belief propagation. The choice 
of inference technique can greatly affect the overall performance of a vision system. For 
example, most of the top-performing algorithms on the Middlebury Stereo Evaluation page 
either use belief propagation or graph cuts. 

In the next few subsections, we review some of the more widely used MRF inference 
techniques. More in-depth descriptions of most of these algorithms can be found in a re- 
cently published book on advances in MRF techniques (Blake, Kohli, and Rother 2010). 
Experimental comparisons, along with test datasets and reference software, are provided by 
Szeliski, Zabih, Scharstein et al. (2008). 6 


B.5.1 Gradient descent and simulated annealing 

The simplest optimization technique is gradient descent, which minimizes the energy by 
changing independent subsets of nodes to take on lower-energy configurations. Such tech- 
niques go under a variety of names, including contextual classification (Kittler and Foglein 
1984) and iterated conditional modes (ICM) (Besag 1986). 7 Variables can either be updated 
sequentially, e.g., in raster scan, or in parallel, e.g., using red-black coloring on a checker- 
board. Chou and Brown (1990) suggests using highest confidence first (HCF), i.e., choosing 
variables based on how large a difference they make in reducing the energy. 

The problem with gradient descent is that it is prone to getting stuck in local minima, 
which is almost always the case with MRF problems. One way around this is to use stochastic 
gradient descent or Markov chain Monte Carlo (MCMC) (Metropolis, Rosenbluth, Rosen- 
bluth et al. 1953), i.e., to randomly take occasional uphill steps in order to get out of such 
minima. One popular update rule is the Gibbs sampler (Geman and Geman 1984); rather 
than choosing the lowest energy state for a variable being updated, it chooses the state with 


6 http://vision.middlebury.edu/MRF/. 

7 The name comes from iteratively setting variables to the mode (most likely, i.e., lowest energy) state conditioned 
on its currently fixed neighbors. 
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probability 

p(x) oc e~ E( ' X ' > l T , (B.25) 

where T is called the temperature and controls how likely the system is to choose a more 
random update. Stochastic gradient descent is usually combined with simulated annealing 
(Kirkpatrick, Gelatt, and Vecchi 1983), which starts at a relatively high temperature, thereby 
randomly exploring a large part of the state space, and gradually cools (anneals) the tem- 
perature to find a good local minimum. During the late 1980s, simulated annealing was the 
method of choice for solving MRF inference problems (Szeliski 1986; Marroquin, Mitter, 
and Poggio 1985; Barnard 1989). 

Another variant on simulated annealing is the Swendsen-Wang algorithm (Swendsen and 
Wang 1987; Barbu and Zhu 2003, 2005). Here, instead of “flipping” (changing) single vari- 
ables, a connected subset of variables, chosen using a random walk based on MRF connec- 
tively strengths, is selected as the basic update unit. This can sometimes help make larger 
state changes, and hence find better-quality solutions in less time. 

While simulated annealing has largely been superseded by the newer graph cuts and loopy 
belief propagation techniques, it still occasionally finds use, especially in highly connected 
and highly non-submodular graphs (Rother, Kolmogorov, Lempitsky et al. 2007). 

B.5.2 Dynamic programming 

Dynamic programming (DP) is an efficient inference procedure that works for any tree- 
structured graphical model, i.e., one that does not have any cycles. Given such a tree, pick 
any node as the root r and figuratively pick up the tree by its root. The depth or distance of all 
the other nodes from this root induces a partial ordering over the vertices, from which a total 
ordering can be obtained by arbitrarily breaking ties. Let us now lay out this graph as a tree 
with the root on the right and indices increasing from left to right, as shown in Figure B.2a. 

Before describing the DP algorithm, let us re-write the potential function of Equation (B.24) 
in a more general but succinct form, 

E{x)= ^2 Vijjxj, xj) + y Vi(xj), (B.26) 

(i,j)e AT i 

where instead of using pixel indices (i,j) and (k,l), we just use scalar index variables i 
and j. We also replace the function value with the more succinct notation x t , with 

the {xi} variables making up the state vector x. We can simplify this function even further 
by adding dummy nodes (vertices) i~ for every node that has a non-zero V, (x,) and setting 
V, t , - (xi, x ^ ) = Vi(xi), which lets us drop the V) terms from (B.26). 

Dynamic programming proceeds by computing partial sums in a left-to-right fashion, i.e., 
in order of increasing variable index. Let Ck be the children of k, i.e., i < k, (i, k) £ A (). 
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Figure B.2 Dynamic programming over a tree drawn as a factor graph, (a) To compute 
the lowest energy solution Ek{x k ) at node x k conditioned on the best solutions to the left 
of this node, we enumerate all possible values of Ei(xi) + Vik{xi,x k ) and pick the smallest 
one (and similarly for j). (b) For higher-order cliques, we need to try all combinations of 
( Xi,Xj ) in order to select the best possible configuration. The arrows show the basic flow 
of the computation. The lightly shaded factor Vij in (a) shows an additional connection that 
turns the tree into a cyclic graph, for which exact inference cannot be efficiently computed. 


Then, define 


Ek(x) 


^ ^ ^i,j ^ ^ ? %k) “ 1 “ Ei(x^ , 

i<k, iGCfc 


(B.27) 


as a partial sum of (B.26) over all variables up to and including k, i.e., over all parts of the 
graph shown in Figure B.2a to the left of x k . This sum depends on the state of all the unknown 
variables in x with i < k. 

Now suppose we wish to find the setting for all variables i < k that minimizes this sum. 
It turns out that we can use a simple recursive formula 


E k {x k ) = min E k {x) = V' ] 

{ Xi , i<k} 

iec k 


Vi,k{xi,Xk) + Ei(xi) 


(B.28) 


to find this minimum. Visually, this is easy to understand. Looking at Figure B.2a, associate 
an energy Ek{x k ) with each node k and each possible setting of its value x k that is based on 
the best possible setting of variables to the left of that node. It is easy to convince yourself 
that in this figure, you only need to know Ei(xi) and Ej(xj) in order to compute this value. 

Once the flow of information in the tree has been processed from left to right, the min- 
imum value of E r (x r ) at the root gives the MAP (lowest-energy) solution for E(x). The 
root node is set to the choice of x r that minimizes this function, and other nodes are set in a 
backward chaining pass by selecting the values of child nodes i £ C k that were minimal in 
the original recursion (B.28). 
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Dynamic programming is not restricted to trees with pairwise potentials. Figure B.2b 
shows an example of a three-way potential Vij^ixi, x 3 , a: fc ) inside a tree. To compute the 
optimum value of Ek(xk), the recursion formula in (B.28) now has to evaluate the mini- 
mum over all combinations of possible state values leading into a factor node (gray box). 
For this reason, dynamic programming is normally exponential in complexity in the order 
of the clique size, i.e., a clique of size n with l labels at each node requires the evaluation 
of l n ~ l possible states (Potetz and Lee 2008; Kohli, Kumar, and Torr 2009). Flowever, for 
certain kinds of potential functions Vi t k(xi,Xk), including the Potts model (delta function), 
absolute values (total variation), and quadratic (Gaussian MRF), Felzenszwalb and Hutten- 
locher (2006) show how to reduce the complexity of the min-finding step (B.28) from Oil 2 ) 
to 0(1). In Appendix B.5.3, we also discuss how Potetz and Lee (2008) reduce the complexity 
for special kinds of higher-order clique, i.e., linear summations followed by non-linearities. 

Figure B.2a also shows what happens if we add an extra factor between nodes i and j. 
In this case, the graph is no longer a tree, i.e., it contains a cycle. It is no longer possible 
to use the recursion formula (B.28), since Ei(xi) now appears in two different terms inside 
the summation, i.e., as a child of both nodes j and k, and the same setting for x, may not 
minimize both. In other words, when loops exist, there is no ordering of the variables that 
allows the recursion (elimination) in (B.28) to be well-founded. 

It is, however, possible to convert small loops into higher-order factors and to solve these 
as shown in Figure B.2b. However, graphs with long loops or meshes result in extremely 
large clique sizes and hence an amount of computation potentially exponential in the size of 
the graph. 


B.5.3 Belief propagation 

Belief propagation is an inference technique originally developed for trees (Pearl 1988) but 
more recently extended to “loopy” (cyclic) graphs such as MRFs (Frey and MacKay 1997; 
Freeman, Pasztor, and Carmichael 2000; Yedidia, Freeman, and Weiss 2001; Weiss and Free- 
man 2001a,b; Yuille 2002; Sun, Zheng, and Shum 2003; Felzenszwalb and Huttenlocher 
2006). It is closely related to dynamic programming, in that both techniques pass messages 
forward and backward over a tree or graph. In fact, one of the two variants of belief prop- 
agation, the max-product rule, performs the exact same computation (inference) as dynamic 
programming, albeit using probabilities instead of energies. 

Recall that the energy we are minimizing in MAP estimation (B.26) is the negative log 
likelihood (B.12, B.13, and B.22) of a factored Gibbs posterior distribution, 

p( x )= n (, h,j{xi,xj ), 

(iJ)eM 


(B.29) 
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where 



(B.30) 


are the pairwise interaction potentials. We can rewrite (B.27) as 


Pk{x) — | fii.j {p'i i Xj ) — 1 1 Pi.ki.xf 


(B.31) 


i<k, j<k 


iec k 


where 


Pi.kix) — i x i ; Xk)Pi (tc) . 


(B.32) 


We can therefore rewrite (B.28) as 


Pk{xk ) = max p k {x) = TT Pi, k (x) 

f rr . r^b\ A A 


(B.33) 


with 


Pi,k{x) = ma x^ ijfc (a; i ,x fe )p i (a;). 


(B.34) 


Equation (B.34) is the max update rule evaluated at all square box factors in Figure B.2a, 
while (B.33) is the product rule evaluated at the nodes. The probability distribution pi^{x) 
is often interpreted as a message passing information about child i to parent k and is hence 
written as rtii^Xk) (Yedidia, Freeman, and Weiss 2001) or p.^kixk) (Bishop 2006). 

The max-product rule can be used to compute the MAP estimate in a tree using the same 
kind of forward and backward sweep as in dynamic programming (which is sometimes called 
the max-sum algorithm (Bishop 2006)). An alternative rule, known as the sum-product , sums 
over all possible values in (B.34) rather than taking the maximum, in essence computing 
the expected distribution rather than the maximum likelihood distribution. This produces a 
set of probability estimates that can be used to compute the marginal distributions bfxt ) = 
Y^x\ x P( x ) (Pearl 1988; Yedidia, Freeman, and Weiss 2001; Bishop 2006). 

Belief propagation may not produce optimal estimates for cyclic graphs for the same 
reason that dynamic programming fails to work, i.e., because a node with multiple parents 
may take on different optimal values for each of the parents, i.e., there is no unique elim- 
ination ordering. Early algorithms for extending belief propagation to graphs with cycles, 
dubbed loopy belief propagation, performed the updates in parallel over the graph, i.e., us- 
ing synchronous updates (Frey and Mac Kay 1997; Freeman, Pasztor, and Carmichael 2000; 
Yedidia, Freeman, and Weiss 2001; Weiss and Freeman 2001a,b; Yuille 2002; Sun, Zheng, 
and Shum 2003; Felzenszwalb and Huttenlocher 2006). 

For example, Felzenszwalb and Huttenlocher (2006) split an A /4 graph into its red and 
black (checkerboard) components and alternate between sending messages from the red nodes 
to the black and vice versa. They also use multi-grid (coarser level) updates to speed up the 
convergence. As discussed previously, to reduce the complexity of the basic max-product 
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update rule (B.28) from 0(l 2 ) to 0(1), they develop specialized update algorithms for sev- 
eral cost functions Vi^(xi,Xk), including the Potts model (delta function), absolute values 
(total variation), and quadratic (Gaussian MRF). A related algorithm, mean field diffusion 
(Scharstein and Szeliski 1998), also uses synchronous updates between nodes to compute 
marginal distributions. Yuille (2010) discusses the relationships between mean field theory 
and loopy belief propagation. 

More recent loopy belief propagation algorithms and their variants use sequential scans 
through the graph (Szeliski, Zabih, Scharstein et al. 2008). For example, Tappen and Free- 
man (2003) pass messages from left to right along each row and then reverse the direction 
once they reach the end. This is similar to treating each row as an independent tree (chain), 
except that messages from nodes above and below the row are also incorporated. They then 
perform similar computations along columns. These sequential updates allow the information 
to propagate much more quickly across the image than synchronous updates. 

The other belief propagation variant tested by Szeliski, Zabih, Scharstein et al. (2008), 
which they call BP-S or TRW-S, is based on Kolmogorov’s (2006) sequential extension of 
the tree-reweighted message passing of Wainwright, Jaakkola, and Willsky (2005). TRW 
first selects a set of trees from the neighborhood graph and computes a set of probability 
distributions over each tree. These are then used to reweight the messages being passed 
during loopy belief propagation. The sequential version of TRW, called TRW-S, processed 
nodes in scan-line order, with a forward and backward pass. In the forward pass, each node 
sends messages to its right and bottom neighbors. In the backward pass, messages are sent 
to the left and upper neighbors. TRW-S also computes a lower bound on the energy, which 
is used by Szeliski, Zabih, Scharstein et al. (2008) to estimate how close to the best possible 
solution all of the MRF inference algorithms being evaluated get. 

As with dynamic programming, belief propagation techniques also become less efficient 
as the order of each factor clique increases. Potetz and Lee (2008) shows how this complex- 
ity can be reduced back to linear in the clique order for continuous-valued problems where 
the factors involve linear summations followed by a non-linearity, which is typical of more 
sophisticated MRF models such as fields of experts (Roth and Black 2009) and steerable ran- 
dom fields (Roth and Black 2007b). Kohli, Kumar, and Torr (2009) and Alahari, Kohli, and 
Torr (2011) develop alternative ways for dealing with higher-order cliques in the context of 
graph cut algorithms. 


B.5.4 Graph cuts 

The computer vision community has adopted “graph cuts” as an informal name to describe 
a large family of MRF inference algorithms based on solving one or more min-cut or max- 
flow problems (Boykov, Veksler, and Zabih 2001; Boykov and Kolmogorov 2010; Boykov, 
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Figure B.3 Graph cuts for minimizing binary sub-modular MRF energies (Boykov and Jolly 
2001) © 2001 IEEE: (a) energy function encoded as a max flow problem; (b) the minimum 
cut determines the region boundary. 

Veksler, and Zabih 2010; Ishikawa and Veksler 2010). 

The simplest example of an MRF graph cut is the polynomial-time algorithm for perform- 
ing exact minimization of a binary MRF originally developed by Greig, Porteous, and Seheult 
(1989) and brought to the attention of the computer vision community by Boykov, Veksler, 
and Zabih (2001) and Boykov and Jolly (2001). The basic construction of the min-cut graph 
from an MRF energy function is shown in Figure B.3 and described in Sections 3.7.2 and 
5.5. In brief, the nodes in an MRF are connected to special source and sink nodes, and the 
minimum cut between these two nodes, whose cost is exactly that of the MRF energy un- 
der a binary assignment of labels, is computed using a polynomial-time max flow algorithm 
(Goldberg and Tarjan 1988; Boykov and Kolmogorov 2004). 

As discussed in Section 5.5, important extensions of this basic algorithm have been made 
for the case of directed edges (Kolmogorov and Boykov 2005), larger neighborhoods (Boykov 
and Kolmogorov 2003; Kolmogorov and Boykov 2005), connectivity priors (Vicente, Kol- 
mogorov, and Rother 2008), and shape priors (Lempitsky and Boykov 2007; Lempitsky, 
Blake, and Rother 2008). Kolmogorov and Zabih (2004) formally characterize the class 
of binary energy potentials ( regularity conditions ) for which these algorithms find the global 
minimum. Komodakis, Tziritas, and Paragios (2008) and Rother, Kolmogorov, Lempitsky et 
al. (2007) provide good algorithms for the cases when they do not. 

Binary MRF problems can also be approximately solved by turning them into continuous 
[0, 1] problems, solving them either as linear systems (Grady 2006; Sinop and Grady 2007; 
Grady and Alvino 2008; Grady 2008; Grady and Ali 2008; Singaraju, Grady, and Vidal 2008; 
Couprie, Grady, Najman et al. 2009) (the random walker model ) or by computing geodesic 
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distances (Bai and Sapiro 2009; Criminisi, Sharp, and Blake 2008) and then thresholding the 
results. More details on these techniques are provided in Section 5.5 and a nice review can 
be found in the work of Singaraju, Grady, Sinop et al. (2010). A different connection to 
continuous segmentation techniques, this time to the literature on level sets (Section 5.1.4), 
is made by Boykov, Kolmogorov, Cremers et al. (2006), who develop an approach to solving 
surface propagation PDEs based on combinatorial graph cut algorithms — Boykov and Funka- 
Lea (2006) discuss this and related techniques. 

Multi-valued MRF inference problems usually require solving a series of related binary 
MRF problems (Boykov, Veksler, and Zabih 2001), although for special cases, such as some 
convex functions, a single graph cut may suffice (Ishikawa 2003; Schlesinger and Flach 
2006). The seminal work in this area is that of Boykov, Veksler, and Zabih (2001), who intro- 
duced two algorithms, called the swap move and the expansion move, which are sketched in 
Figure B.4. The o-/7-swap move selects two labels (usually by cycling through all possible 
pairings) and then formulates a binary MRF problem that allows any pixels currently labeled 
as either a or j3 to optionally switch their values to the other label. The o-expansion move 
allows any pixel in the MRF to take on the a label or to keep its current identity. It is easy 
to see by inspection that both of these moves result in binary MRFs with well-defined energy 
functions. 

Because these algorithms use a binary MRF optimization inside their inner loop, they 
are subject to the constraints on the energy functions that occur in the binary labeling case 
(Kolmogorov and Zabih 2004). However, more recent algorithms such as those developed by 
Komodakis, Tziritas, and Paragios (2008) and Rother, Kolmogorov, Fempitsky et al. (2007) 
can be used to provide approximate solutions for more general energy functions. Efficient 
algorithms for re-using previous solutions (flow- and cut-recycling) have been developed for 
on-line applications such as dynamic MRFs (Kohli and Torr 2005; Juan and Boykov 2006; 
Alahari, Kohli, and Torr 2011) and coarse-to-fine banded graph cuts (Agarwala, Zheng, Pal et 
al. 2005; Fombaert, Sun, Grady et al. 2005; Juan and Boykov 2006). It is also now possible to 
minimize the number of labels used as part of the alpha-expansion process (Delong, Osokin, 
Isack et al. 2010). 

In experimental comparisons, o-expansions usually converge faster to a good solution 
than a-T-swaps (Szeliski, Zabih, Scharstein et al. 2008), especially for problems that in- 
volve large regions of identical labels, such as the labeling of source imagery in image stitch- 
ing (Figure 3.60). For truncated convex energy functions defined over ordinal values, more 
accurate algorithms that consider complete ranges of labels inside each min-cut and often 
produce lower energies have been developed (Veksler 2007; Kumar and Torr 2008; Kumar, 
Veksler, and Torr 2010). The whole field of efficient MRF inference algorithms is rapidly 
developing, as witnessed by a recent special journal issue (Kohli and Torr 2008; Komodakis, 
Tziritas, and Paragios 2008; Olsson, Eriksson, and Kahl 2008; Potetz and Fee 2008), articles 
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(a) initial labeling (b) standard move (c) a-f3 - swap (d) a-expansion 


Figure B.4 Multi-level graph optimization from (Boykov, Veksler, and Zabih 2001) © 2001 
IEEE: (a) initial problem configuration; (b) the standard move changes only one pixel; (c) 
the a-/3-swap optimally exchanges all a- and ,3- labeled pixels; (d) the a-expansion move 
optimally selects among current pixel values and the a label. 


(Alahari, Kohli, and Torr 201 1), and a forthcoming book (Blake, Kohli, and Rother 2010). 


B.5.5 Linear programming 

8 Many successful algorithms for MRF optimization are based on the linear programming 
(LP) relaxation of the energy function (Weiss, Yanover, and Meltzer 2010). For some prac- 
tical MRF problems, LP-based techniques can produce globally minimal solutions (Meltzer, 
Yanover, and Weiss 2005), even though MRF inference is in general NP-hard. In order to 
describe this relaxation, let us first rewrite the energy function (B.26) as 


E(x) 

— ^ ^ ^i,j ^ ^ Vi (%i) 

(B.35) 


(i,j) eV 
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Here, a and (3 range over label values and x^ a = 5(xi — a) and Xij - a 0 = S(xi — a)S(xj — (3) 
are indicator variables of assignments Xi = a and ( Xi,Xj ) = (a, (3), respectively. The LP 
relaxation is obtained by replacing the discreteness constraints (B.39) with linear constraints 
£ [0, 1]. It is easy to show that the optimal value of (B.36) is a lower bound on (B.26). 


This section was contributed by Vladimir Kolmogorov. Thanks! 
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This relaxation has been extensively studied in the literature, starting with the work of 
Schlesinger (1976). An important question is how to solve this LP efficiently. Unfortunately, 
general-purpose LP solvers cannot handle large problems in vision (Yanover, Meltzer, and 
Weiss 2006). A large number of customized iterative techniques have been proposed. Most 
of these solve the dual problem, i.e., they formulate a lower bound on (B.36) and then try to 
maximize this bound. The bound is often formulated using a convex combination of trees, as 
proposed in (Wainwright, Jaakkola, and Willsky 2005). 

The LP lower bound can be maximized via a number of techniques, such as max-sum dif- 
fusion (Werner 2007), tree-reweighted message passing (TRW) (Wainwright, Jaakkola, and 
Willsky 2005; Kolmogorov 2006), subgradient methods (Schlesinger and Giginyak 2007a,b; 
Komodakis, Paragios, and Tziritas 2007), and Bregman projections (Ravikumar, Agarwal, 
and Wainwright 2008). Note that the max-sum diffusion and TRW algorithms are not guar- 
anteed to converge to a global maximum of LP — they may get stuck at a suboptimal point 
(Kolmogorov 2006; Werner 2007). However, in practice, this does not appear to be a problem 
(Kolmogorov 2006). 

For some vision applications, algorithms based on relaxation (B.36) produce excellent 
results. However, this is not guaranteed in all cases — after all, the problem is NP-hard. 
Recently, researchers have investigated alternative linear programming relaxations (Sontag 
and Jaakkola 2007; Sontag, Meltzer, Globerson et al. 2008; Komodakis and Paragios 2008; 
Schraudolph 2010). These algorithms are capable of producing tighter bounds compared to 
(B.36) at the expense of additional computational cost. 

LP relaxation and alpha expansion. Solving a linear program produces primal and dual 
solutions that satisfy complementary slackness conditions. In general, the primal solution 
of (B.36) does not have to be integer-valued so, in practice, we may have to round it to 
obtain a valid labeling x. An alternative proposed by Komodakis and Tziritas (2007a); Ko- 
modakis, Tziritas, and Paragios (2007) is to search for primal and dual solutions such that 
they satisfy approximate complementary slackness conditions and the primal solution is al- 
ready integer- valued. Several max-flow-based algorithms are proposed by (Komodakis and 
Tziritas 2007a; Komodakis, Tziritas, and Paragios 2007) for this purpose and the Fast-PD 
method (Komodakis, Tziritas, and Paragios 2007) is shown to perform best. In the case of 
metric interactions, the default version of Fast-PD produces the same primal solution as the 
alpha-expansion algorithm (Boykov, Veksler, and Zabih 2001). This provides an interesting 
interpretation of the alpha expansion algorithm as trying to approximately solve relaxation 
(B.36). 

Unlike the standard alpha expansion algorithm, Fast-PD also maintains a dual solution 
and thus runs faster in practice. Fast-PD can be extended to the case of semi-metric interac- 
tions (Komodakis, Tziritas, and Paragios 2007). The primal version of such extension was 
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also given by Rother, Kumar, Kolmogorov et al. (2005). 

B.6 Uncertainty estimation (error analysis) 

In addition to computing the most likely estimate, many applications require an estimate for 
the uncertainty in this estimate. 9 The most general way to do this is to compute a complete 
probability distribution over all of the unknowns but this is generally intractable. The one spe- 
cial case where it is easy to obtain a simple description for this distribution is linear estimation 
problems with Gaussian noise, where the joint energy function (negative log likelihood of the 
posterior estimate) is a quadratic. In this case, the posterior distribution is a multi-variate 
Gaussian and the covariance can be computed directly from the inverse of the problem Hes- 
sian. (Another name for the inverse covariance matrix, which is equal to the Hessian in such 
simple cases, is the information matrix.) 

Even here, however, the full covariance matrix may be too large to compute and store. For 
example, in large structure from motion problems, a large sparse Hessian normally results in a 
full dense covariance matrix. In such cases, it is often considered acceptable to report only the 
variance in the estimated quantities or simple covariance estimates on individual parameters, 
such as 3D point positions or camera pose estimates (Szeliski 1990a). More insight into the 
problem, e.g., the dominant modes of uncertainty, can be obtained using eigenvalue analysis 
(Szeliski and Kang 1997). 

For problems where the posterior energy is non-quadratic, e.g., in non-linear or robustified 
least squares, it is still often possible to obtain an estimate of the Hessian in the vicinity of the 
optimal solution. In this case, the Cramer-Rao lower bound on the uncertainty (covariance) 
can be computed as the inverse of the Hessian. Another way of saying this is that while the 
local Hessian can underestimate how “wide” the energy function can be, the covariance can 
never be smaller than the estimate based on this local quadratic approximation. It is also 
possible to estimate a different kind of uncertainty (min-marginal energies) in general MRFs 
where the MAP inference is performed using graph cuts (Kohli and Torr 2008). 

While many computer vision applications ignore uncertainty modeling, it is often useful 
to compute these estimates just to get an intuitive feeling for the reliability of the estimates. 
Certain applications, such as Kalman filtering, require the computation of this uncertainty 
(either explicitly as posterior covariances or implicitly as inverse covariances) in order to 
optimally integrate new measurements with previously computed estimates. 


9 This is particularly true of classic photogrammetry applications, where the reporting of precision is almost 
always considered mandatory (Fdrstner 2005). 
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In this final appendix, I summarize some of the supplementary materials that may be use- 
ful to students, instructors, and researchers. The book’s Web site at http://szeliski.org/Book 
contains updated lists of datasets and software, so please check there as well. 

C.l Data sets 

One of the keys to developing reliable vision algorithms is to test your procedures on chal- 
lenging and representative data sets. When ground truth or other people’s results are available, 
such test can be even more informative (and quantitative). 

Over the years, a large number of datasets have been developed for testing and evaluating 
computer vision algorithms. A number of these datasets (and software) are indexed on the 
Computer Vision Homepage. 1 Some newer Web sites, such as CVonline (http://homepages. 
inf.ed.ac.uk/rbf/CVonline/), VisionBib.Com (http://datasets.visionbib.com/), and Computer 
Vision online (http://computervisiononline.com/), have more recent pointers. 

Below, I list some of the more popular data sets, grouped by the book chapters to which 
they most closely correspond: 

Chapter 2: Image formation 

CUReT: Columbia-Utrecht Reflectance and Texture Database, http://wwwl.cs. Columbia. 
edu/CAVE/software/curet/ (Dana, van Ginneken, Nayar et al. 1999). 

Middlebury Color Datasets: registered color images taken by different cameras to 
study how they transform gamuts and colors, http://vision.middlebury.edu/color/data/ 
(Chakrabarti, Scharstein, and Zickler 2009). 

Chapter 3: Image processing 

Middlebury test datasets for evaluating MRF minimization/inference algorithms, http: 
//vision. middlebury.edu/MRF/results/ (Szeliski, Zabih, Scharstein et al. 2008). 

Chapter 4: Feature detection and matching 

Affine Covariant Features database for evaluating feature detector and descriptor match- 
ing quality and repeatability, http://www.robots.ox.ac.uk/~vgg/research/affine/ (Miko- 
lajczyk and Schmid 2005; Mikolajczyk, Tuytelaars, Schmid et al. 2005). 

Database of matched image patches for learning and feature descriptor evaluation, 
http://cvlab.epfl.ch/~brown/patchdata/patchdata.html (Winder and Brown 2007; Hua, 
Brown, and Winder 2007). 

1 http://www.cs.cmu.edu/~cil/vision.html, although it has not been maintained since 2004. 


C. 1 Data sets 


779 


Chapter 5: Segmentation 

Berkeley Segmentation Dataset and Benchmark of 1000 images labeled by 30 humans, 
along with an evaluation, http://www.eecs.berkeley.edu/Research/Projects/CS/vision/ 
grouping/segbench/ (Martin, Fowlkes, Tal et al. 2001). 

Weizmann segmentation evaluation database of 100 grayscale images with ground 
truth segmentations, http://www.wisdom.weizmann.ac.il/~vision/Seg_Evaluation_DB/ 
index.html (Alpert, Galun, Basri et al. 2007). 

Chapter 8: Dense motion estimation 

The Middlebury optic flow evaluation Web site, http://vision.middlebury.edu/flow/data 
(Baker, Scharstein, Lewis et al. 2009). 

The Human- Assisted Motion Annotation database, 

http://people.csail.mit.edu/celiu/motionAnnotation/ (Liu, Freeman, Adelson et al. 2008) 
Chapter 10: Computational photography 

High Dynamic Range radiance maps, http://www.debevec.org/Research/HDR/ (De- 
bevec and Malik 1997). 

Alpha matting evaluation Web site, http://alphamatting.com/ (Rhemann, Rother, Wang 
et al. 2009). 

Chapter 11: Stereo correspondence 

Middlebury Stereo Datasets and Evaluation, http://vision.middlebury.edu/stereo/ (Scharstein 
and Szeliski 2002). 

Stereo Classification and Performance Evaluation of different aggregation costs for 
stereo matching, http://www.vision.deis.unibo.it/spe/SPEHome.aspx (Tombari, Mat- 
toccia, Di Stefano et al. 2008). 

Middlebury Multi-View Stereo Datasets, http://vision.middlebury.edu/mview/data/ (Seitz, 
Curless, Diebel et al. 2006). 

Multi-view and Oxford Colleges building reconstructions, http://www.robots.ox.ac.uk/ 
~vgg/data/data-mview.html. 

Multi-View Stereo Datasets, http://cvlab.epfl.ch/data/strechamvs/ (Strecha, Fransens, 
and Van Gool 2006). 
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Multi-View Evaluation, http://cvlab.epfl.ch/~strecha/multiview/ (Strecha, von Hansen, 

Van Gool et al. 2008). 

Chapter 12: 3D reconstruction 

HumanEva: synchronized video and motion capture dataset for evaluation of artic- 
ulated human motion, http://vision.cs.brown.edu/humaneva/ (Sigal, Balan, and Black 
2010 ). 

Chapter 13: Image-based rendering 

The (New) Stanford Light Field Archive, http://lightfield.stanford.edu/ (Wilburn, Joshi, 
Vaish et al. 2005). 

Virtual Viewpoint Video: multi-viewpoint video with per-frame depth maps, http: 
//research. microsoft.com/en-us/um/redmond/groups/ivm/vvv/ (Zitnick, Kang, Uytten- 
daele et al. 2004). 

Chapter 14: Recognition 

For a list of visual recognition datasets, see Tables 14.1-14.2. In addition to those, 
there are also: 

Buffy pose classes, http://www.robots.ox.ac.uk/~vgg/data/buffy_pose_classes/ and Buffy 
stickmen V2.1, http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html (Ferrari, Marin- 
Jimenez, and Zisserman 2009; Eichner and Ferrari 2009). 

H3D database of pose/joint annotated photographs of humans, http://www.eecs.berkeley. 
edu/~lbourdev/h3d/ (Bourdev and Malik 2009). 

Action Recognition Datasets, http://www.cs.berkeley.edu/projects/vision/action, has point- 
ers to several datasets for action and activity recognition, as well as some papers. The 
human action database at http://www.nada.kth.se/cvap/actions/ contains more action 
sequences. 


C.2 Software 

One of the best sources for computer vision algorithms is the Open Source Computer Vision 
(OpenCV) library (http://opencv.willowgarage.com/wiki/), which was developed by Gary 
Bradski and his colleagues at Intel and is now being maintained and extended at Willow 
Garage (Bradsky and Kaehler 2008). A partial list of the available functions, taken from 
http://opencv.willowgarage.com/documentation/cpp/ includes: 
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• image processing and transforms (filtering, morphology, pyramids); 

• geometric image transformations (rotations, resizing); 

• miscellaneous image transformations (Fourier transforms, distance transforms); 

• histograms; 

• segmentation (watershed, mean shift); 

• feature detection (Canny, Harris, Hough, MSER, SURF); 

• motion analysis and object tracking (Lucas-Kanade, mean shift); 

• camera calibration and 3D reconstruction; 

• machine learning ( k nearest neighbors, support vector machines, decision trees, boost- 
ing, random trees, expectation-maximization, and neural networks). 

The Intel Performance Primitives (IPP) library, http://software.intel.com/en-us/intel-ipp/, 
contains highly optimized code for a variety of image processing tasks. Many of the routines 
in OpenCV take advantage of this library, if it is installed, to run even faster. In terms of 
functionality, it has many of the same operators as those found in OpenCV, plus additional 
libraries for image and video compression, signal and speech processing, and matrix algebra. 

The MATLAB Image Processing Toolbox, http://www.mathworks.com/products/image/, 
contains routines for spatial transformations (rotations, resizing), normalized cross-correla- 
tion, image analysis and statistics (edges. Hough transform), image enhancement (adaptive 
histogram equalization, median filtering) and restoration (deblurring), linear filtering (con- 
volution), image transforms (Fourier and DCT), and morphological operations (connected 
components and distance transforms). 

Two older libraries, which no longer appear to be under active development but contain 
many useful routines, are VXL (C++ Libraries for Computer Vision Research and Implemen- 
tation, http://vxl.sourceforge.net/) and LTI-Lib 2 (http://www.ie.itcr.ac.cr/palvarado/ltilib-2/ 
homepage/). 

Photo editing and viewing packages, such as Windows Live Photo Gallery, iPhoto, Picasa, 
GIMP, and Irfan View, can be useful for performing common processing tasks, converting for- 
mats, and viewing your results. They can also serve as interesting reference implementations 
for image processing algorithms (such as tone correction or denoising) that you are trying to 
develop from scratch. 

There are also software packages and infrastructure that can be helpful for building real- 
time video processing demos. Vision on Tap (http://www.visionontap.com/) provides a Web 
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service that will process your webcam video in real time (Chiu and Raskar 2009). Video- 
Man (VideoManager, http://videomanlib.sourceforge.net/) can be useful for getting real-time 
video-based demos and applications running. You can also use imread in MATLAB to read 
directly from any URL, such as a webcam. 

Below, I list some additional software that can be found on the Web, grouped by the book 
chapters to which they most correspond: 

Chapter 3: Image processing 

matlabPyrTools — MATLAB source code for Laplacian pyramids, QMF/Wavelets, and 
steerable pyramids, http://www.cns.nyu.edu/~lcv/software.php (Simoncelli and Adel- 
son 1990a; Simoncelli, Freeman, Adelson et al. 1992). 

BLS-GSM image denoising, http://decsai.ugr.es/~javier/denoise/ (Portilla, Strela, Wain- 
wright et al. 2003). 

Fast bilateral filtering code, http://people.csail.mit.edU/jiawen/#code (Chen, Paris, and 
Durand 2007). 

C++ implementation of the fast distance transform algorithm, http://people.cs. uchicago. 
edu/~pff/dt/ (Felzenszwalb and Huttenlocher 2004a). 

GREYC’s Magic Image Converter, including image restoration software using regular- 
ization and anisotropic diffusion, http://gmic.sourceforge.net/gimp.shtml (Tschumperle 
and Deriche 2005). 

Chapter 4: Feature detection and matching 

VLFeat, an open and portable library of computer vision algorithms, http://vlfeat.org/ 
(Vedaldi and Fulkerson 2008). 

SiftGPU: A GPU Implementation of Scale Invariant Feature Transform (SIFT), http: 
//www.cs. unc.edu/~ccwu/siftgpu/ (Wu 2010). 

SURF: Speeded Up Robust Features, http://www.vision.ee.ethz.ch/~surf/ (Bay, Tuyte- 
laars, and Van Gool 2006). 

FAST corner detection, http://mi.eng.cam.ac.uk/~er258/work/fast.html (Rosten and Drum- 
mond 2005, 2006). 

Linux binaries for affine region detectors and descriptors, as well as MATLAB files to 
compute repeatability and matching scores, http://www.robots.ox.ac.uk/~vgg/research/ 
affine/. 
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Kanade-Lucas-Tomasi feature trackers: KLT, http://www.ces.clemson.edu/~stb/klt/ 

(Shi and Tomasi 1994); GPU-KLT, http://cs.unc.edu/~cmzach/opensource.html (Zach, 
Gallup, and Frahm 2008); and Lucas-Kanade 20 Years On, http://www.ri.cmu.edu/ 
projects/project_5 15.html (Baker and Matthews 2004). 

Chapter 5: Segmentation 

Efficient graph-based image segmentation, http://people.cs.uchicago.edu/~pff/segment/ 
(Felzenszwalb and Huttenlocher 2004b). 

EDISON, edge detection and image segmentation, http://coewww.rutgers.edu/riul/research/ 
code/EDISON/ (Meer and Georgescu 2001; Comaniciu and Meer 2002). 

Normalized cuts segmentation including intervening contours, http://www.cis.upenn. 
edu/~jshi/software/ (Shi and Malik 2000; Malik, Belongie, Leung et al. 2001). 

Segmentation by weighted aggregation (SWA), http://www.cs.weizmann.ac.il/~vision/ 
SWA/ (Alpert, Galun, Basri et al. 2007). 

Chapter 6: Feature-based alignment and calibration 

Non-iterative PnP algorithm, http://cvlab.epfl.ch/software/EPnP/ (Moreno-Noguer, Lep- 
etit, and Fua 2007). 

Tsai Camera Calibration Software, http://www-2.cs.cmu.edu/~rgw/TsaiCode.html (Tsai 
1987). 

Easy Camera Calibration Toolkit, http://research.microsoft.com/en-us/um/people/zhang/ 
Calib/ (Zhang 2000). 

Camera Calibration Toolbox for MATLAB, http://www.vision.caltech.edu/bouguetj/ 
calib.doc/; a C version is included in OpenCV. 

MATLAB functions for multiple view geometry, http://www.robots.ox.ac.uk/~vgg/hzbook/ 
code/ (Hartley and Zisserman 2004). 

Chapter 7: Structure from motion 

SBA: A generic sparse bundle adjustment C/C++ package based on the Levenberg- 
Marquardt algorithm, http://www.ics.forth.gr/~lourakis/sba/ (Lourakis and Argyros 2009). 


Simple sparse bundle adjustment (SSBA), http://cs.unc.edu/~cmzach/opensource.html. 
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Bundler, structure from motion for unordered image collections, http://phototour.cs. 
washington.edu/bundler/ (Snavely, Seitz, and Szeliski 2006). 

Chapter 8: Dense motion estimation 

Optical flow software, http://www.cs.brown.edu/~black/code.html (Black and Anan- 
dan 1996). 

Optical flow using total variation and conjugate gradient descent, http://people.csaO. 
mit.edu/celiu/OpticalFlow/ (Liu 2009). 

TV-L1 optical flow on the GPU, http://cs.unc.edu/~cmzach/opensource.html (Zach, 
Pock, and Bischof 2007a). 

elastix: a toolbox for rigid and nonrigid registration of images, http://elastix.isi.uu.nl/ 
(Klein, Staring, and Pluim 2007). 

Deformable image registration using discrete optimization, http://www.mrf-registration. 
net/deformable/index. html (Glocker, Komodakis, Tziritas el al. 2008). 

Chapter 9: Image stitching 

Microsoft Research Image Compositing Editor for stitching images, http://research. 
microsoft.com/en-us/um/redmond/groups/ivm/ice/. 

Chapter 10: Computational photography 

HDRShop software for combining bracketed exposures into high-dynamic range radi- 
ance images, http://projects.ict.usc.edu/graphics/HDRShop/. 

Super-resolution code, http://www.robots.ox.ac.uk/~vgg/software/SR/ (Pickup 2007; 
Pickup, Capel, Roberts et al. 2007, 2009). 

Chapter 11: Stereo correspondence 

StereoMatcher, standalone C++ stereo matching code, http://vision.middlebury.edu/ 
stereo/code/ (Scharstein and Szeliski 2002). 

Patch-based multi-view stereo software (PMVS Version 2), http://grail.cs. Washington, 
edu/software/pmvs/ (Furukawa and Ponce 2011). 

Chapter 12: 3D reconstruction 

Scanalyze: a system for aligning and merging range data, http://graphics.stanford.edu/ 
software/scanalyze/ (Curless and Levoy 1996). 
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MeshLab: software for processing, editing, and visualizing unstructured 3D triangular 
meshes, http://meshlab.sourceforge.net/. 

VRML viewers (various) are also a good way to visualize texture-mapped 3D models. 

Section 12.6.4: Whole body modeling and tracking 

Bayesian 3D person tracking, http://www.cs.brown.edu/~black/code.html (Sidenbladh, 

Black, and Fleet 2000; Sidenbladh and Black 2003). 

HumanEva: baseline code for the tracking of articulated human motion, http://vision. 
cs.brown.edu/humaneva/ (Sigal, Balan, and Black 2010). 

Section 14.1.1: Face detection 

Sample face detection code and evaluation tools, 
http://vision.ai.uiuc.edu/mhyang/face-detection-survey.html. 

Section 14.1.2: Pedestrian detection 

A simple object detector with boosting, http://people.csail.mit.edu/torralba/shortCourseRLOC/ 
boosting/boosting. html (Hastie, Tibshirani, and Friedman 2001; Torralba, Murphy, and 
Freeman 2007). 

Discriminatively trained deformable part models, http://people.cs.uchicago.edu/~pff/ 
latent/ (Felzenszwalb, Girshick, McAllester et al. 2010). 

Upper-body detector, http://www.robots.ox.ac.uk/~vgg/software/UpperBody/ (Ferrari, 
Marin-Jimenez, and Zisserman 2008). 

2D articulated human pose estimation software, http://www.vision.ee.ethz.ch/~calvin/ 
articulated_human_pose_estimation_code/ (Eichner and Ferrari 2009). 

Section 14.2.2: Active appearance and 3D shape models 

AAMtools: An active appearance modeling toolbox, http://cvsp.cs.ntua.gr/software/ 
AAMtools/ (Papandreou and Maragos 2008). 

Section 14.3: Instance recognition 

FASTANN and FASTCLUSTER for approximate k-means (AKM), http://www.robots. 
ox.ac.uk/~vgg/software/ (Philbin, Chum, Isard et al. 2007). 

Feature matching using fast approximate nearest neighbors, http://people.cs.ubc.ca/ 
~mariusm/index.php/FLANN/FLANN (Muja and Lowe 2009). 
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Section 14.4.1: Bag of words 

Two bag of words classifiers, http://people.csail.mit.edu/fergus/iccv2005/bagwords.html 
(Fei-Fei and Perona 2005; Sivic, Russell, Efros et al. 2005). 

Bag of features and hierarchical k-means, http://www.vlfeat.org/ (Nister and Stewenius 
2006; Nowak, Jurie, and Triggs 2006). 

Section 14.4.2: Part-based models 

A simple parts and structure object detector, http://people.csail.mit.edu/fergus/iccv2005/ 
partsstructure.html (Fischler and Elschlager 1973; Felzenszwalb and Huttenlocher2005). 

Section 14.5.1: Machine learning software 

Support vector machines (SVM) software (http://www.support-vector-machines.org/ 
SVM_soft.html) has pointers to lots of SVM libraries, including SVM^# , http:// 
svmlight.joachims.org/; LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (Fan, Chen, 
and Lin 2005); and LIBLINEAR, http://www.csie.ntu.edu.tw/~cjlin/liblinear/ (Fan, 
Chang, Hsieh et al. 2008). 

Kernel Machines: links to SVM, Gaussian processes, boosting, and other machine 
learning algorithms, http://www.kernel-machines.org/software. 

Multiple kernels for image classification, http://www.robots.ox.ac.uk/~vgg/software/ 
MKL/ (Varma and Ray 2007; Vedaldi, Gulshan, Varma et al. 2009). 

Appendix A.1-A.2: Matrix decompositions and linear least squares 2 

BLAS (Basic Linear Algebra Subprograms), http://www.netlib.org/blas/ (Blackford, 
Demmel, Dongarra et al. 2002). 

LAPACK (Linear Algebra PACKage), http://www.netlib.org/lapack/ (Anderson, Bai, 
Bischof et al. 1999). 

GotoBLAS, http://www.tacc.utexas.edu/tacc-projects/. 

ATLAS (Automatically Tuned Linear Algebra Software), http://math-atlas.sourceforge. 
net / (Demmel, Dongarra, Eijkhout et al. 2005). 

Intel Math Kernel Library (MKL), http://software.intel.com/en-us/intel-mkl/. 

AMD Core Math Library (ACML), http://developer.amd.com/cpu/Libraries/acmPPages/ 
default, aspx. 

2 Thanks to Sameer Agarwal for suggesting and describing most of these sites. 
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Robust PCA code, http://www.salle.url.edu/~ftorre/papers/rpca2.html (De la Torre and 
Black 2003). 

Appendix A. 3: Non-linear least squares 

MINPACK, http://www.netlib.org/minpack/. 

levmar: Levenberg-Marquardt nonlinear least squares algorithms, http://www.ics. forth. 
gr/~lourakis/levmar/ (Madsen, Nielsen, and Tingleff 2004). 

Appendix A.4-A.5: Direct and iterative sparse matrix solvers 

SuiteSparse (various reordering algorithms, CHOLMOD) and SuiteSparse QR, http: 
//www.cise.ufl.edu/research/sparse/SuiteSparse/ (Davis 2006, 2008). 

PARDISO (iterative and sparse direct solution), http://www.pardiso-project.org/. 

TAUCS (sparse direct, iterative, out of core, preconditioners), http://www.tau.ac.il/ 
~stoledo/taucs/. 

HSL Mathematical Software Library, http://www.hsl.rl.ac.uk/index.html. 

Templates for the solution of linear systems, http://www.netlib.org/linalg/htmLtemplates/ 
Templates.html (Barrett, Berry, Chan el al. 1994). Download the PDF for instructions 
on how to get the software. 

ITSOL, MIQR, and other sparse solvers, http://www-users.cs.umn.edu/~saad/software/ 
(Saad 2003). 

ILUPACK, http://www-public.tu-bs.de/~bolle/ilupack/. 

Appendix B: Bayesian modeling and inference 

Middlebury source code for MRF minimization, http://vision.middlebury.edu/MRF/ 
code/ (Szeliski, Zabih, Scharstein et al. 2008). 

C++ code for efficient belief propagation for early vision, http://people.cs. uchicago. 
edu/~pff/bp/ (Felzenszwalb and Huttenlocher 2006). 

FastPD MRF optimization code, http://www.csd.uoc.gr/~komod/FastPD (Komodakis 
and Tziritas 2007a; Komodakis, Tziritas, and Paragios 2008) 
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double urand ( ) 

{ 

return ((double) rand() ) / ((double) RAND_MAX) ; 

} 

void grand (doubles gl, doubles g2) 

{ 

#ifndef M_PI 

#def ine M_PI 3.14159265358979323846 
tendif // M_PI 

double nl = urand () ; 
double n2 = urand () ; 

double xl = nl + (nl == 0); /* guard against log(0) */ 
double sqlognl = sqrt(-2.0 * log (xl)); 
double angl = (2.0 * NLP I) * n2; 
gl = sqlognl * cos (angl) ; 
g2 = sqlognl * sin (angl); 

} 


Algorithm C.l C algorithm for Gaussian random noise generation, using the Box-Muller 
transform. 

Gaussian noise generation. A lot of basic software packages come with a uniform random 
noise generator (e.g., the rand ( ) routine in Unix), but not all have a Gaussian random 
noise generator. To compute a normally distributed random variable, you can use the Box- 
Muller transform (Box and Muller 1958), whose C code is given in Algorithm C. 1 — note that 
this routine returns pairs of random variables. Alternative methods for generating Gaussian 
random numbers are given by Thomas, Luk, Leong et al. (2007). 


Pseudocolor generation. In many applications, it is convenient to be able to visualize the 
set of labels assigned to an image (or to image features such as lines). One of the easiest 
ways to do this is to assign a unique color to each integer label. In my work, I have found it 
convenient to distribute these labels in a quasi-uniform fashion around the RGB color cube 
using the following idea. 

For each (non-negative) label value, consider the bits as being split among the three color 
channels, e.g., for a nine -bit value, the bits could be labeled RGB RGB RGB. After collecting 
each of the three color values, reverse the bits so that the low-order bits vary the most quickly. 
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In practice, for eight-bit color channels, this bit reverse can be stored in a table or a complete 
table mapping from labels to pseudocolors (say with 4092 entries) can be pre-computed. 
Figure 8.16 shows an example of such a pseudo-color mapping. 

GPU implementation 

The advent of programmable GPUs with capabilities such as pixel shaders and compute 
shaders has led to the development of fast computer vision algorithms for real-time appli- 
cations such as segmentation, tracking, stereo, and motion estimation (Pock, Unger, Cremers 
el al. 2008; Vineet and Narayanan 2008; Zach, Gallup, and Frahm 2008). A good source 
for learning about such algorithms is the CVPR 2008 workshop on Visual Computer Vision 
on GPUs (CVGPU), http://www.cs.unc.edu/~jmf/Workshop_on_Computer_Vision_on_GPU. 
html, whose papers can be found on the CVPR 2008 proceedings DVD. Additional sources 
for GPU algorithms include the GPGPU Web site and workshops, http://gpgpu.org/, and the 
OpenVIDIA Web site, http://openvidia.sourceforge.net/index.php/OpenVIDIA. 

C.3 Slides and lectures 

As I mentioned in the preface, I hope to post slides corresponding to the material in the book. 
Until these are ready, your best bet is to look at the slides from the courses I have co-taught 
at the University of Washington, as well as related courses that have used a similar syllabus. 
Here is a partial list of such courses: 

UW 455: Undergraduate Computer Vision, http://www.cs.washington.edu/education/ 
courses/455/. 

UW 576: Graduate Computer Vision, http://www.cs.washington.edu/education/courses/ 
576/. 

Stanford CS233B: Introduction to Computer Vision, http://vision.stanford.edu/teaching/ 
cs223b/. 

MIT 6.869: Advances in Computer Vision, http://people.csail.mit.edu/torralba/courses/ 
6. 869/6. 869. computervision.htm. 

Berkeley CS 280: Computer Vision, http://www.eecs.berkeley.edu/~trevor/CS280.html. 

UNC COMP 776: Computer Vision, http://www.cs.unc.edu/~lazebnik/springlO/. 

Middlebury CS 453: Computer Vision, http://www.cs.middlebury.edu/~schar/courses/ 
cs453-sl0/. 
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Related courses have also been taught on the topic of Computational Photography, e.g., 

CMU 15-463: Computational Photography, http://graphics.cs.cmu.edu/courses/15-463/. 

MIT 6.815/6.865: Advanced Computational Photography, http://stellar.mit.edU/S/course/ 
6/sp09/6.815/. 

Stanford CS 448A: Computational photography on cell phones, http://graphics. Stanford. 
edu/courses/cs448a- 10/. 

SIGGRAPH courses on Computational Photography, http://web.media.mit.edu/~raskar/ 
photo/. 

There is also an excellent set of on-line lectures available on a range of computer vision 
topics, such as belief propagation and graph cuts, at the UW-MSR Course of Vision Algo- 
rithms http://www.cs.washington.edu/education/courses/577/04sp/. 

C.4 Bibliography 

While a bibliography (BibTex .bib file) for all of the references cited in this book is avail- 
able on the book’s Web site, a much more comprehensive partially annotated bibliography 
of nearly all computer vision publications is maintained by Keith Price at http://iris.usc.edu/ 
Vision-Notes/bibliography/contents.html. There is also a searchable computer graphics bibli- 
ography at http://www.siggraph.org/publications/bibliography/. Additional good sources for 
technical papers are Google Scholar and CiteSeer x . 
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3D Rotations, see Rotations 
3D alignment, 320 

absolute orientation, 320, 588 
orthogonal Procrustes, 320 
3D photography, 613 
3D video, 643 

Absolute orientation, 320, 588 
Active appearance model (AAM), 680 
Active contours, 270 
Active illumination, 585 
Active rangefinding, 585 
Active shape model (ASM), 276, 680 
Activity recognition, 610 
Adaptive smoothing, 127 
Affine transforms, 37, 40 
Affinities (segmentation), 296 
normalizing, 297 
Algebraic multigrid, 288 
Algorithms 

testing, viii 
Aliasing, 77, 476 
Alignment, see Image alignment 
Alpha 

opacity, 106 
pre-multiplied, 106 
Alpha matte, 105 


Ambient illumination, 65 
Analog to digital conversion (ADC), 77 
Anisotropic diffusion, 127 
Anisotropic filtering, 168 
Anti-aliasing filter, 78, 476 
Aperture, 69 
Aperture problem, 394 
Applications, 5 

3D model reconstruction, 362, 37 1 

3D photography, 613 

augmented reality, 326, 368 

automotive safety, 5 

background replacement, 558 

biometrics, 668 

colorization, 504 

de-interlacing, 415 

digital heritage, 590 

document scanning, 432 

edge editing, 249 

facial animation, 603 

flash photography, 494 

frame interpolation, 418 

gaze correction, 552 

head tracking, 55 1 

hole filling, 521 

image restoration, 192 

image search, 717 
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industrial, 7 

intelligent photo editing, 709 
Internet photos, 37 1 
location recognition, 693 
machine inspection, 5 
match move, 368 
medical imaging, 5, 304, 408 
morphing, 173 

mosaic-based video compression, 436 
non-photorealistic rendering, 522 
Optical character recognition (OCR), 5 
panography, 314 

performance-driven animation, 237 

photo pop-up, 710 

Photo Tourism, 624 

Photomontage, 459 

planar pattern tracking, 326 

rotoscoping, 282 

scene completion, 709 

scratch removal, 521 

single view reconstruction, 331 

tonal adjustment. 111 

video denoising, 414 

video stabilization, 401 

video summarization, 436 

video-based walkthroughs, 645 

VideoMouse, 326 

view morphing, 357 

visual effects, 5 

whiteboard scanning, 432 

z-keying, 558 

Arc length parameterization of a curve, 246 

Architectural reconstruction, 598 

Area statistics, 132 

mean (centroid), 132 
perimeter, 132 

second moment (inertia), 132 

Aspect ratio, 52, 53 


Augmented reality, 326, 338, 368 
Auto-calibration, 355 
Automatic gain control (AGC), 76 
Axis/angle representation of rotations, 41 

B-snake, 273 

B-spline, 171, 172, 250, 273, 279, 408 
cubic, 146 
multilevel, 592 
octree, 597 

Background plate, 518 
Background subtraction (maintenance), 606 
Bag of words (keypoints), 697, 727 
distance metrics, 698 
Band-pass filter, 118 
Bartlett filter, see Bilinear kernel 
Bayer pattern (RGB sensor mosaic), 85 
demosaicing, 86, 502 
Bayes’ rule, 141, 180, 762 

MAP (maximum a posteriori) estimate, 
763 

posterior distribution, 762 
Bayesian modeling, 180, 762 
MAP estimate, 180, 763 
matting, 510 

posterior distribution, 180, 762 
prior distribution, 180, 762 
uncertainty, 180 

Belief propagation (BP), 185, 768 
update rule, 769 
Bias, 104, 386 

Bidirectional Reflectance Distribution Func- 
tion, see BRDF 
Bilateral filter, 125 
joint, 496 
range kernel, 125 
tone mapping, 489 
Bilinear blending, 110 
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Bilinear kernel, 117 
Biometrics, 668 
Bipartite problem, 364 
Blind image deconvolution, 498 
Block-based motion estimation 
(block matching), 387 
Blocks world, 1 1 

Blue screen matting, 106, 195, 507 
Blur kernel, 69 

estimation, 476, 528 
Blur removal, 144, 197 
Body color, 63 

Boltzmann distribution, 181,763 
Boosting, 663 

AdaBoost algorithm, 665 
decision stump, 663 
weak learner, 663 

Border (boundary) effects, 114, 196 

Boundary detection, 244 

Box filter, 117 

Boxlet, 121 

BRDF, 62 

anisotropic, 62 
isotropic, 62 
recovery, 612 

spatially varying (SVBRDF), 612 
Brightness, 104 
Brightness constancy, 3, 384 
Brightness constancy constraint, 384, 393, 
410 

Bundle adjustment, 363 

Calibration, see Camera calibration 
Calibration matrix, 5 1 
Camera calibration, 50, 97 
accuracy, 340 
aliasing, 476 
extrinsic, 51, 321 


intrinsic, 50, 327 
optical blur, 476, 528 
patterns, 327 
photometric, 470 
plumb-line method, 335, 341 
point spread function, 476, 528 
radial distortion, 334 
radiometric, 470, 481, 526 
rotational motion, 332, 339 
slant edge, 476 
vanishing points, 329 
vignetting, 474 
Camera matrix, 51, 54 
Catadioptric optics, 71 
Category-level recognition, 696 
bag of words, 697, 727 
data sets, 718 
part -based, 701 
segmentation, 704 
surveys, 723 
CCD, 74 

blooming, 74 
Central difference, 118 
Chained transformations, 325, 364 
Chamfer matching, 129 
Characteristic function, 131, 281, 590, 596 
Characteristic polynomial, 740 
Chirality, 347, 351 
Cholesky factorization, 741 
algorithm, 741 
incomplete, 752 
sparse, 749 

Chromatic aberration, 71, 342 
Chromaticity coordinates, 83 
CIE L*a*b*, see Color 
CIE L*u*v*, see Color 
CIE XYZ, see Color 
Circle of confusion, 69 


936 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 


CLAHE, see Histogram equalization 
Clustering 

agglomerative, 286 
cluster analysis, 269, 305 
divisive, 286 
CMOS, 74 
Co-vector, 37 
Coefficient matrix, 177 
Collineation, 40 
Color, 80 

balance, 86, 97, 194 
camera, 84 
demosaicing, 86, 502 
fringing, 504 

hue, saturation, value (HSV), 90 

L*a*b*, 83 

L* u *v*, 84, 289 

primaries, 81 

profile, 473 

ratios, 90 

RGB, 81 

transform, 104 

twist, 86, 105 

XYZ, 81 

YIQ, 88 

YUV, 88 

Color filter array (CFA), 85, 502 
Color line model, 513 
ColorChecker chart, 473 
Colorization, 504 
Compositing, 105, 192, 195 
image stitching, 450 
opacity, 106 
over operator, 106 
surface, 45 1 
transparency, 106 
Compression, 90 
Computational photography, 467 


active illumination, 496 
flash and non-flash, 494 
high dynamic range, 479 
references, 469, 524 
tone mapping, 487 
Concentric mosaic, 437, 634 
CONDENSATION, 279 
Condition number, 750 
Conditional random field (CRF), 188, 553, 
708 

Confusion matrix (table), 226 
Conic section, 33 

Conjugate gradient descent (CG), 749 
algorithm, 751 
non-linear, 750 
preconditioned, 751 
Connected components, 131, 198 
Constellation model, 704 
Content based image retrieval (CBIR), 717 
Continuation method, 179 
Contour 

arc length parameterization, 246 
chain code, 246 
matching, 248, 263 
smoothing, 248 
Contrast, 104 

Controlled-continuity spline, 175 
Convolution, 112 
kernel. 111 
mask. 111 
superposition, 112 
Coring, 153, 201 
Correlation, 111, 386 
windowed, 390 
Correspondence map, 398 
Cramer-Rao lower bound, 320, 397, 775 
Cube map 

Hough transform, 253 
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image stitching, 45 1 
Curve 

arc length parameterization, 246 
evolution, 248 
matching, 248 
smoothing, 248 
Cylindrical coordinates, 438 

Data energy (term), 181, 763 
Data sets and test databases, 778 
recognition, 718 
De-interlacing, 415 
Decimation, 148 
Decimation kernels 
bicubic, 150 
binomial, 148, 150 
QMF, 150 
windowed sine, 148 
Demosaicing (Bayer), 86, 502 
Depth from defocus, 584 
Depth map, see Disparity map 
Depth of field, 69, 95 
Di-chromatic reflection model, 67 
Difference matting (keying), 106, 195, 508, 
606 

Difference of Gaussians (DoG), 152 
Difference of low-pass (DOLP), 152 
Diffuse reflection, 63 
Diffusion 

anisotropic, 127 
Digital camera, 73 
color, 84 

color filter array (CFA), 85 
compression, 90 
Direct current (DC), 92 
Direct linear transform (DLT), 322 
Direct sparse matrix techniques, 747 
Directional derivative, 119 


selectivity, 120 

Discrete cosine transform (DCT), 91, 142 
Discrete Fourier transform (DFT), 134 
Discriminative random field (DRF), 190 
Disparity, 49, 539 
Disparity map, 540, 562 
multiple, 561 

Disparity space image (DSI), 540 
generalized, 542 

Displaced frame difference (DFD), 384 
Displacement field, 170 
Distance from face space (DFFS), 672 
Distance in face space (DIFS), 672 
Distance map, see Distance transform 
Distance transform, 129, 198 
Euclidean, 129 
image stitching, 455 
Manhattan (city block), 129 
signed, 130 

Domain (of a function), 103 
Domain scaling law, 167 
Downsampling, see Decimation 
Dynamic programming (DP), 554, 766 
monotonicity, 556 
ordering constraint, 556 
scanline optimization, 556 
Dynamic snake, 276 
Dynamic texture, 642 

Earth mover’s distance (EMD), 698 
Edge detection, 238, 261 
boundary detection, 244 
Canny, 239 
chain code, 246 
color, 243 

Difference of Gaussian, 240 
edgel (edge element), 241 
hysteresis, 246 
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Laplacian of Gaussian, 240 
linking, 244, 262 
marching cubes, 241 
scale selection, 242 
steerable filter, 241 
zero crossing, 241 
Eigenface, 671 

Eigenvalue decomposition, 275, 671, 737 
Eigenvector, 737 
Elastic deformations, 408 
image registration, 408 
Elastic nets, 272 

Elliptical weighted average (EWA), 168 
Environment map, 61, 633 
Environment matte, 634 
Epanechnikov kernel, 294 
Epipolar constraint, 348 
Epipolar geometry, 348, 537 
pure rotation, 353 
pure translation, 352 
Epipolar line, 537 
Epipolar plane, 537, 544 
image (EPI), 559, 629 
Epipolar volume, 629 
Epipole, 348, 537 
Error rates 

accuracy (ACC), 229 
false negative (FN), 226 
false positive (FP), 226 
positive predictive value (PPV), 229 
precision, 229 
recall, 229 
ROC curve, 229 
true negative (TN), 226 
true positive (TP), 226 
Errors-in-variable model, 442, 744 
heteroscedastic, 746 
Essential matrix, 348 


5-point algorithm, 352 
eight-point algorithm, 349 
re-normalization, 350 
seven-point algorithm, 350 
twisted pair, 351 
Estimation theory, 757 
Euclidean transformation, 36, 40 
Euler angles, 41 

Expectation maximization (EM), 291 
Exponential twist, 43 
Exposure bracketing, 480 
Exposure value (EV), 70, 470 

F-number (stop), 69, 95 
Face detection, 658 
boosting, 663 
cascade of classifiers, 664 
clustering and PCA, 660 
data sets, 718 
neural networks, 661 
support vector machines, 662 
Face modeling, 601 
Face recognition, 668 

active appearance model, 680 
data sets, 718 
eigenface, 67 1 

elastic bunch graph matching, 679 
local binary patterns (LBP), 722 
local feature analysis, 679 
Face transfer, 639 

Facial motion capture, 603, 605, 639 
Factor graph, 181, 764, 768 
Factorization, 15, 357 
missing data, 360 
projective, 360 

Fast Fourier transform (FFT), 134 
Fast marching method (FMM), 282 
Feature descriptor, 222, 260 
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bias and gain normalization, 222 
GLOH, 223 
patch, 222 
PCA-SIFT, 223 
performance (evaluation), 224 
quantization, 234, 691, 698 
SIFT, 223 
steerable filter, 224 
Feature detection, 207, 209, 259 

Adaptive non-maximal suppression, 215 
affine invariance, 219 
auto-correlation, 210 
Forstner, 212 
Harris, 212 

Laplacian of Gaussian, 217 
MSER, 220 
region, 221 
repeatability, 215 
rotation invariance, 218 
scale invariance, 216 
Feature matching, 207, 225, 261 
densification, 234 
efficiency, 232 
error rates, 226 
hashing, 232 
indexing structure, 232 
k-d trees, 233 

locality sensitive hashing, 233 
nearest neighbor, 229 
strategy, 226 
verification, 234 
Feature tracking, 235, 261 
affine, 235 
learning, 236 
Feature tracks, 357, 371 
Feature-based alignment, 311 
2D, 311 
3D, 320 


iterative, 315 
Jacobian, 312 
least squares, 312 
match verification, 686 
RANSAC, 318 
robust, 318 

Field of Experts (FoE), 186 
Fill factor, 75 
Fill-in, 366, 748 
Filter 

adaptive, 127 
band-pass, 118 
bilateral, 125 
directional derivative, 119 
edge-preserving, 124, 127 
Laplacian of Gaussian, 1 19 
median, 124 
moving average, 117 
non-linear, 122 
separable, 115, 197 
steerable, 119, 198 
Filter coefficients. 111 
Filter kernel, see Kernel 
Finding faces, see Face detection 
Finite element analysis, 176 
stiffness matrix, 177 

Finite impulse response (FIR) filter. 111, 122 
Fisher information matrix, 313, 320, 758, 
775 

Fisher’s linear discriminant (FLD), 676 
Fisheye lens, 59 

Flash and non-flash merging, 494 
Flash matting, 517 
Flip-book animation, 336 
Flying spot scanner, 587 
Focal length, 52, 53, 69 
Focus, 69 

shape-from, 584, 616 
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Focus of expansion (FOE), 352 
Form factor, 68 

Forward mapping, see Forward warping 
Forward warping, 164, 202 
Fourier transform, 132, 198 
discrete, 134 
examples, 136 
magnitude (gain), 133 
pairs, 136 

Parseval’s Theorem, 136 
phase (shift), 133 
power spectrum, 140 
properties, 134 
two-dimensional, 140 
Fourier-based motion estimation, 388 
rotations and scale, 391 
Frame interpolation, 418 
Free-viewpoint video, 644 
Fundamental matrix, 353 

estimation, see Essential matrix 
Fundamental radiometric relation, 73 

Gain, 104, 386 
Gamma, 104 

Gamma correction, 87, 96 
Gap closing (image stitching), 435 
Garbage matte, 518 
Gaussian kernel, 117 

Gaussian Markov random field (GMRF), 
184, 191,499 

Gaussian mixtures, see Mixture of Gaussians 
Gaussian pyramid, 150 
Gaussian scale mixtures (GSM), 186 
Gaze correction, 552 
Geman-McClure function, 385 
Generalized cylinders, 12, 588, 593 
Geodesic active contour, 282 
Geodesic distance (segmentation), 304 


Geometric image formation, 3 1 
Geometric lens aberrations, 70 
Geometric primitives, 32 

homogeneous coordinates, 32 
lines, 32, 34 
normal vector, 32 
normal vectors, 34 
planes, 33 
points, 32, 33 
Geometric transformations 
2D, 35, 163 
3D, 39 

3D perspective, 40 
3D rotations, 41 
affine, 37, 40 
bilinear, 39 
calibration matrix, 5 1 
collineation, 40 
Euclidean, 36, 40 
forward warping, 164, 202 
hierarchy, 37 

homography, 37, 40, 56, 43 1 
inverse warping, 165 
perspective, 37 
projections, 46 
projective, 37 
rigid body, 36, 40 
scaled rotation, 36, 40 
similarity, 36, 40 
translation, 36, 39 
Geometry image, 594 
Gesture recognition, 605 
Gibbs distribution, 181, 763 
Gibbs sampler, 765 
Gimbal lock, 41 
Gist (of a scene), 709, 714 
Global illumination, 67 
Global optimization, 174 
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GPU algorithms, 789 

Gradient location-orientation histogram 
(GLOH), 223 

Graduated non-convexity (GNC), 179 
Graph cuts 

MRF inference, 183, 770 
normalized cuts, 296 
Graph-based segmentation, 286 
Grassfire transform, 130, 248, 455 
Ground control points, 350, 429 

Hammersley-Clifford theorem, 181, 763 
Hann window, 138 

Harris corner detector, see Feature detection 
Head tracking, 55 1 

active appearance model (AAM), 680 
Helmholtz reciprocity, 62 
Hessian, 177, 213, 313, 315, 320, 394, 399, 
743 

eigenvalues, 397 
image, 394, 411 
inverse, 320, 397, 401 
local, 410 
patch-based, 400 
rank-deficient, 370 
reduced motion, 366 
sparse, 366, 379, 747 
Heteroscedastic, 313, 746 
Hidden Markov model (HMM), 642 
Hierarchical motion estimation, 387 
High dynamic range (HDR) imaging, 479 
formats, 486 
tone mapping, 487 
Highest confidence first, 765 
Highest confidence first (HCF), 182 
Hilbert transform pair, 120 
Histogram equalization, 107, 196 
locally adaptive, 109, 196 


Histogram intersection, 698 

Histogram of oriented gradients (HOG), 666 

History of computer vision, 10 

Hole filling, 521 

Homogeneous coordinates, 32, 347 
Homography, 37, 56, 43 1 
Hough transform, 251, 264 
cascaded, 253 
cube map, 253 
generalized, 25 1 

Human body shape modeling, 609 
Human motion tracking, 605 
activity recognition, 610 
adaptive shape modeling, 609 
background subtraction, 606 
flow-based, 607 
initialization, 607 
kinematic models, 607 
particle filtering, 608 
probabilistic models, 608 
Hyper-Laplacian, 179, 184, 186 

Ideal points, 32 

Ill-posed (ill-conditioned) problems, 175 
Illusions, 3 
Image alignment 

feature-based, 311, 543 
intensity-based, 384 
intensity-based vs. feature-based, 450 
Image analogies, 522 
Image blending 
feathering, 455 
GIST, 461 

gradient domain, 459 
image stitching, 453 
Poisson, 460 
pyramid, 160, 459 

Image compositing, see Compositing 
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Image compression, 90 
Image decimation, 148 
Image deconvolution, see Blur removal 
Image filtering. 111 
Image formation 
geometric, 31 
photometric, 60 
Image gradient, 119, 127, 392 
constraint, 177 
Image interpolation, 145 
Image matting, 505, 529 
Image processing, 101 
textbooks, 101, 192 
Image pyramid, 144, 200 
Image resampling, 163, 199 
test images, 200 
Image restoration, 144, 192 

blur removal, 144, 197, 199 
deblocking, 204 
inpainting, 192 
noise removal, 144, 197, 203 
using MRFs, 192 
Image search, 717 

Image segmentation, see Segmentation 
Image sensing, see Sensing 
Image statistics, 132 
Image stitching, 427 
automatic, 446 
bundle adjustment, 441 
compositing, 450 
coordinate transformations, 452 
cube map, 45 1 
cylindrical, 438, 463 
de-ghosting, 446, 456, 464 
direct vs. feature-based, 450 
exposure compensation, 462 
feathering, 455 
gap closing, 435 


global alignment, 441 
homography, 43 1 
motion models, 430 
panography, 314 
parallax removal, 445 
photogrammetry, 429 
pixel selection, 453 
planar perspective motion, 43 1 
recognizing panoramas, 446 
rotational motion, 433 
seam selection, 456 
spherical, 439 
up vector selection, 444 
Image warping, 163, 201, 388 
Image-based modeling, 623 
Image-based rendering, 619 
concentric mosaic, 634 
environment matte, 634 
impostors, 626 
layered depth image, 626 
layers, 626 
light field, 628 
Lumigraph, 628 

modeling vs. rendering continuum, 637 
sprites, 626 
surface light field, 632 
unstructured Lumigraph, 632 
view interpolation, 621 
view-dependent texture maps, 623 
Image-based visual hull, 569 
ImageNet, 716 
Implicit surface, 596 
Impostors, see Sprites 
Impulse response, 112 
Incremental refinement 

motion estimation, 388, 392 
Incremental rotation, 45 
Indexing structure, 232 
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Indicator function, 596 
Industrial applications, 7 
Infinite impulse response (HR) filter, 122 
Influence function, 179, 318, 761 
Information matrix, 313, 320, 370, 758, 775 
Inpainting, 521 
Instance recognition, 685 
algorithm, 690 
data sets, 718 
geometric alignment, 686 
inverted index, 687 
large scale, 687 
match verification, 686 
query expansion, 692 
stop list, 689 
visual words, 688 
vocabulary tree, 691 
Integrability constraint, 581 
Integral image, 120 
Integrating sphere, 472 
Intelligent scissors, 280 
Interaction potential, 181, 763, 768 
Interactive computer vision, 614 
International Color Consortium (ICC), 473 
Internet photos, 37 1 
Interpolation, 145 
Interpolation kernels 
bicubic, 146 
bilinear, 145 
binomial, 145 
sine, 148 
spline, 148 

Intrinsic camera calibration, 327 
Intrinsic images, 12 
Inverse kinematics (IK), 607 
Inverse mapping, see Inverse warping 
Inverse problems, 3, 175 
Inverse warping, 165 


ISO setting, 76 

Iterated closest point (ICP), 272, 321, 588 
Iterated conditional modes (ICM), 182, 765 
Iterative back projection (IBP), 499 
Iterative feature-based alignment, 315 
Iterative sparse matrix techniques, 748 
conjugate gradient, 749 
Iteratively reweighted least squares 
(IRLS), 318, 324, 398,761 

Jacobian, 312, 325, 364, 392, 746 
image, 394 
motion, 399 
sparse, 366, 379, 747 
Joint bilateral filter, 496 
Joint domain (feature space), 294 

K-d trees, 233 
K-means, 289 
Kalman snakes, 276 

Kanade-Lucas-Tomasi (KLT) tracker, 235 
Karhunen-Loeve transform, 143, 671 
Kernel, 117 

bilinear, 117 
Gaussian, 117 
low-pass, 117 
Sobel operator, 118 
unsharp mask, 117 
Kernel basis function, 176 
Kernel density estimation, 292 
Keypoint detection, see Feature detection 
Kinematic model (chain), 607 
Kruppa equations, 356 

L*a*b*, see Color 
L*u*v*, see Color 
Li norm, 179, 385,411,597 
Loo norm, 367 
Lambertian reflection, 63 
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Laplacian matting, 515 
Laplacian of Gaussian (LoG) filter, 119 
Laplacian pyramid, 151 
blending, 160, 200, 459 
perfect reconstruction, 151 
Latent Dirichlet process (LDP), 713 
Layered depth image (LDI), 626 
Layered depth panorama, 634 
Layered motion estimation, 415 
transparent, 419 
Layers 

image-based rendering, 626 
Layout consistent random field, 708 
Learning in computer vision, 7 14 
Least median of squares (LMS), 318 
Least squares 

iterative solvers, 324, 748 
linear, 94, 312, 320, 384, 738, 742, 756, 
760, 786 

non-linear, 315, 324, 347, 746, 760, 787 
robust, see Robust least squares 
sparse, 364, 748, 787 
total, 744 

weighted, 313, 494, 498, 505 

Lens 

compound, 71 
nodal point, 7 1 
thin, 69 

Lens distortions, 58 
calibration, 334 
decentering, 59 
radial, 58 
spline -based, 59 
tangential, 59 
Lens law, 69 

Level of detail (LOD), 594 
Level sets, 281, 282 

fast marching method, 282 


geodesic active contour, 282 
Levenberg-Marquardt, 316, 371, 379, 747, 
783 

Lifting, see Wavelets 
Light field 

higher dimensional, 636 
light slab, 629 
ray space, 63 1 
rendering, 628 
surface, 632 
Lightness, 84 
Line at infinity, 32 
Line detection, 250 

Hough transform, 251, 264 
RANSAC, 254 
simplification, 250, 264 
successive approximation, 251, 264 
Line equation, 32, 34 
Line fitting, 94, 264 
uncertainty, 265 
Line hull, see Visual hull 
Line labeling, 1 1 
Line process, 194, 553, 764 
Line spread function (LSF), 476 
Line-based structure from motion, 374 
Linear algebra, 735 
least squares, 742 
matrix decompositions, 736 
references, 736 
Linear blend, 104 

Linear discriminant analysis (LDA), 676 
Linear filtering, 111 
Linear operator, 104 
superposition, 104 

Linear shift invariant (LSI) filter, 1 12 
Live-wire, 280 
Local distance functions, 679 
Local operator. 111 
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Locality sensitive hashing (LSH), 233 
Locally adaptive histogram equalization, 109 
Location recognition, 693 
Loopy belief propagation (LBP), 185, 769 
Low-pass filter, 117 
sine, 117 
Lumigraph, 628 

unstructured, 632 
Luminance, 82 
Lumisphere, 633 

M-estimator, 318, 384, 761 
Mahalanobis distance, 291, 673, 677, 758 
Manifold mosaic, 455, 649 
Markov chain Monte Carlo (MCMC), 760, 
765 

Markov random field, 180, 763 
cliques, 181, 764 
directed edges, 302 
dynamic, 772 
flux, 302 

inference, see MRF inference 
layout consistent, 708 
learning parameters, 180 
line process, 194, 553, 764 
neighborhood, 181, 763 
order, 181, 765 
random walker, 303 
stereo matching, 553 
Marks framework, 13 

computational theory, 13 
hardware implementation, 13 
representations and algorithms, 13 
Match move, 368 
Matrix decompositions, 736 
Cholesky, 741 
eigenvalue (ED), 737 
QR, 740 


singular value (SVD), 736 
square root, 741 
Matte reflection, 63 
Matting, 105, 106, 505, 529 
alpha matte, 105 
Bayesian, 510 
blue screen, 106, 195, 507 
difference, 106, 195, 508, 606 
flash, 517 
GrabCut, 513 
Laplacian, 514 
natural, 509 

optimization-based, 513 
Poisson, 513 
shadow, 517 
smoke, 516 
triangulation, 507, 518 
trimap, 509 
two screen, 507 
video, 518 

Maximally stable extremal region (MSER), 
220 

Maximum a posteriori (MAP) estimate, 180, 
763 

Mean absolute difference (MAD), 547 
Mean average precision, 229 
Mean shift, 289, 292 

bandwidth selection, 295 
Mean square error (MSE), 92, 547 
Measurement equation (model), 346, 757 
Measurement matrix, 359 
Measurement model, see Bayesian model 
Medial axis transform (MAT), 130 
Median absolute deviation (MAD), 385 
Median filter, 124 
weighted, 124 

Medical image registration, 408 
Medical image segmentation, 304 
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Membrane, 175 
Mesh-based warping, 170, 201 
Metamer, 82 
Metric learning, 679 
Metric tree, 234 
MIP-mapping, 167 
trilinear, 168 

Mixture of Gaussians, 272, 279, 289 
color model, 509 

expectation maximization (EM), 291 
mixing coefficient, 29 1 
soft assignment, 291 
Model selection, 430, 763 
Model-based reconstruction, 598 
architecture, 598 
heads and faces, 601 
human body, 605 
Model-based stereo, 599, 624 
Models 

Bayesian, 180, 762 
forward, 3 
physically based, 14 
physics-based, 3 
probabilistic, 3 
Modular eigenspace, 678 
Modulation transfer function (MTF), 79, 476 
Morphable model 
body, 609 
face, 603, 639 
multidimensional, 639 
Morphing, 173, 202, 622, 623 
3D body, 609 
3D face, 603 
automated, 424 
facial feature, 639 
feature-based, 173, 202 
flow-based, 424 
video textures, 642 


view morphing, 623, 650 
Morphological operator, 127 
closing, 128 
dilation, 128 
erosion, 128 
opening, 128 
Morphology, 127 
Mosaic, see Image stitching 
Mosaics 

motion models, 430 
video compression, 436 
whiteboard and document scanning, 432 
Motion compensated video compression, 
387, 421 

Motion compensation, 92 
Motion estimation, 383 
affine, 398 

aperture problem, 394 
compositional, 400 
Fourier-based, 388 
frame interpolation, 418 
hierarchical, 387 
incremental refinement, 392 
layered, 415 
learning, 403, 411 
linear appearance variation, 397 
optical flow, 409 
parametric, 398 
patch-based, 384, 399 
phase correlation, 390 
quadtree spline-based, 407 
reflections, 419 
spline-based, 404 
translational, 384 
transparent, 419 
uncertainty modeling, 395 
Motion field, 398 
Motion models 
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learned, 403 

Motion segmentation, 425 
Motion stereo, 561 
Motion-based user interaction, 425 
Moving least squares (MLS), 596 
MRF inference, 182, 765 

alpha expansion, 185, 772 
belief propagation, 185, 768 
dynamic programming, 766 
expansion move, 185, 772 
gradient descent, 765 
graph cuts, 183, 770 
highest confidence first, 182 
highest confidence first (HCF), 765 
iterated conditional modes, 182, 765 
linear programming (LP), 773 
loopy belief propagation, 185, 769 
Markov chain Monte Carlo, 765 
simulated annealing, 182, 766 
stochastic gradient descent, 182, 765 
swap move (alpha-beta), 185, 772 
Swendsen-Wang, 766 
Multi-frame motion estimation, 413 
Multi-pass transforms, 169 
Multi-perspective panoramas, 437 
Multi-perspective plane sweep (MPPS), 445 
Multi-view stereo, 558 

epipolar plane image, 559 
evaluation, 567 

initialization requirements, 566 
reconstruction algorithm, 565 
scene representation, 563 
shape priors, 565 
silhouettes, 567 
space carving, 566 

spatio-temporally shiftable window, 560 
taxonomy, 563 
visibility, 565 


volumetric, 562 
voxel coloring, 566 
Multigrid, 753 

algebraic (AMG), 288, 753 
Multiple hypothesis tracking, 279 
Multiple-center-of-projection images, 437, 
649 

Multiresolution representation, 150 
Mutual information, 386, 408 

Natural image matting, 509 
Nearest neighbor 

distance ratio (NNDR), 230 
matching, see Feature matching 
Negative posterior log likelihood, 180, 758, 
762 

Neighborhood operator. 111, 122 
Neural networks, 661 
Nintendo Wii, 326 
Nodal point, 7 1 
Noise 

sensor, 76, 473 

Noise level function (NLF), 76, 96, 473, 527 
Noise removal, 144, 197, 203 
Non-linear filter, 122, 193 
Non-linear least squares 
seeLeast squares, 315 

Non-maximal suppression, see Feature detec- 
tion 

Non-parametric density modeling, 292 
Non-photorealistic rendering (NPR), 522 
Non-rigid motion, 377 
Normal equations, 313, 393, 743, 746 
Normal map (geometry image), 594 
Normal vector, 34 

Normalized cross-correlation (NCC), 386, 
422, 547 

Normalized cuts, 296 


948 


Computer Vision: Algorithms and Applications (September 3, 2010 draft) 


intervening contour, 298 
Normalized device coordinates (NDC), 49, 
54 

Normalized sum of squared differences 
(NSSD), 387 
Norms 

L x , 179, 385,411,597 
Loo. 367 

Nyquist rate / frequency, 78 

Object detection, 658 
car, 666, 722 
face, 658 
part-based, 667 
pedestrian, 666, 684 
Object-centered projection, 57 
Occluding contours, 543 
Octree reconstruction, 569 
Octree spline, 409 

Omnidirectional vision systems, 646 

Opacity, 106 

Operator 

linearity, 104 

Optic flow, see Optical flow 
Optical center, 52 
Optical flow, 409 

anisotropic smoothness, 411 
evaluation, 413 
fusion move, 413 
global and local, 410 
Markov random field, 411 
multi-frame, 413 
normal flow, 394 
patch-based, 409 
region-based, 417 
regularization, 410 
robust regularization, 41 1 
smoothness, 410 


total variation, 411 
Optical flow constraint equation, 393 
Optical illusions, 3 

Optical transfer function (OTF), 79, 476 
Optical triangulation, 586 
Optics, 68 

chromatic aberration, 7 1 
Seidel aberrations, 70 
vignetting, 72, 527 
Optimal motion estimation, 363 
Oriented particles (points), 595 
Orthogonal Procrustes, 320 
Orthographic projection, 46 
Osculating circle, 544 
Over operator, 106 
Overview, 19 

Padding, 114, 196 
Panography, 314, 337 
Panorama, see Image stitching 
Panorama with depth, 438, 542, 634 
Para-perspective projection, 48 
Parallel tracking and mapping (PTAM), 369 
Parameter sensitive hashing, 233 
Parametric motion estimation, 398 
Parametric surface, 593 
Parametric transformation, 163, 201 
Parseval’s Theorem, see Fourier transform 
Part-based recognition, 701 
constellation model, 704 
Particle filtering, 279, 608, 760 
Parzen window, 292 

PASCAL Visual Object Classes Challenge 
(VOC), 718 

Patch-based motion estimation, 384 
Peak signal-to-noise Ratio (PSNR), 92, 144 
Pedestrian detection, 666 
Penumbra, 60 
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Performance-driven animation, 237, 605, 639 
Perspective n-point problem (PnP), 322 
Perspective projection, 48 
Perspective transform (2D), 37 
Phase correlation, 390, 422 
Phong shading, 65 
Photo pop-up, 710 
Photo Tourism, 624 
Photo-mosaic, 429 
Photoconsistency, 540, 565 
Photometric image formation, 60 
calibration, 470 
global illumination, 67 
lighting, 60 
optics, 68 
radiosity, 67 
reflectance, 62 
shading, 65 

Photometric stereo, 582 
Photometry, 60 
Photomontage, 459 
Physically based models, 14 
Physics-based vision, 16 
Pictorial structures, 12, 19, 701 
Pixel transform, 103 
Pliicker coordinates, 35 
Planar pattern tracking, 326 
Plane at infinity, 34 
Plane equation, 33 

Plane plus parallax, 55, 405, 417, 540, 626 

Plane sweep, 540, 572 

Plane-based structure from motion, 376 

Plenoptic function, 628 

Plenoptic modeling, 623 

Plumb-line calibration method, 335, 341 

Point distribution model, 275 

Point operator, 101 

Point process, 101 


Point spread function (PSF), 78 
estimation, 476, 528 
Point-based representations, 595 
Points at infinity, 32 
Poisson 

blending, 460 
equations, 597 
matting, 513 
noise, 76 

surface reconstruction, 597 
Polar coordinates, 33 
Polar projection, 59, 440 
Polyphase filter, 145 
Pop-out effect, 4 
Pose estimation, 321 
iterative, 324 
Power spectrum, 140 
Precision, see Error rates 
mean average, 229 
Preconditioning, 751 

Principal component analysis (PCA), 275, 
660, 671, 738, 758 
face modeling, 601 
generalized, 740 
missing data, 360, 740 
Prior energy (term), 181, 763 
Prior model, see Bayesian model 
Profile curves, 543 
Progressive mesh (PM), 594 
Projections 

object-centered, 57 
orthographic, 46 
para-perspective, 48 
perspective, 48 

Projective (uncalibrated) reconstruction, 353 
Projective depth, 55, 540 
Projective disparity, 55, 540 
Projective space, 32 
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PROSAC (PROgressive SArnple Consensus), 
319 

PSNR, see Peak signal-to-noise ratio 
Pyramid, 144, 200 

blending, 160, 200 
Gaussian, 150 
half-octave, 152 
Laplacian, 151 
motion estimation, 387 
octave, 150 

radial frequency implementation, 159 
steerable, 159 
Pyramid match kernel, 698 

QR factorization, 740 
Quadratic form, 177 
Quadrature mirror filter (QMF), 150 
Quadric equation, 33, 35 
Quadtree spline 

motion estimation, 407 
restricted, 407 
Quaternions, 43 
antipodal, 43 
multiplication, 44 

Query by image content (QBIC), 717 
Query expansion, 692 
Quincunx sampling, 152 

Radial basis function, 171, 176, 592 
Radial distortion, 58 
barrel, 58 
calibration, 334 
parameters, 58 
pincushion, 58 
Radiance map, 483 
Radiometric image formation, 60 
Radiometric response function, 470 
Radiometry, 60 
Radiosity, 68 


Random walker, 303, 771 
Range (of a function), 103 
Range data, see Range scan 
Range image, see Range scan 
Range scan 

alignment, 588, 617 
large scenes, 590 
merging, 589 
registration, 588, 617 
segmentation, 588 
volumetric, 590 

Range sensing (rangefinding), 585 
coded pattern, 587 
light stripe, 586 
shadow stripe, 586, 616 
spacetime stereo, 587 
stereo, 587 

texture pattern (checkerboard), 587 
time of flight, 587 
RANSAC 

(RAndom SArnple Consensus), 318 
inliers, 319 
preemptive, 319 
progressive (PROSAC), 319 
RAW image format, 77 
Ray space (light field), 63 1 
Ray tracing, 68 
Rayleigh quotient, 297 
Recall, see Error rates 
Receiver Operating Characteristic 
area under the curve (AUC), 229 
mean average precision, 229 
ROC curve, 229, 260 
Recognition, 655 
3D models, 725 
category (class), 696 
color similarity, 717 
context, 712 


Index 


951 


contour-based, 724 
data sets, 718 
face, 668 
instance, 685 
large scale, 715 
learning, 714 
part-based, 701 
scene understanding, 712 
segmentation, 704 
shape context, 724 
Rectangle detection, 257 
Rectification, 538, 571 

standard rectified geometry, 539 
Recursive filter, 122 
Reference plane, 55 
Reflectance, 62 
Reflectance map, 580 
Reflectance modeling, 611 
Reflection 

di-chromatic, 67 
diffuse, 63 
specular, 64 
Region 

merging, 286 
splitting, 286 

Region segmentation, see Segmentation 
Registration, see Image Alignment 
feature-based, 311 
intensity-based, 384 
medical image, 408 
Regularization, 174, 407 
robust, 178 

Regularization parameter, 176 
Residual error, 312, 318, 346, 363, 384, 393, 
399,410,411,742, 750 
RGB (red green blue), see Color 
Rigid body transformation, 36, 40 
Robust error metric, see Robust penalty func- 


tion 

Robust least squares, 255, 256, 318, 384, 761 
iteratively reweighted, 318, 324, 398, 
761 

Robust penalty function, 178, 384, 397, 499, 
542, 547, 548, 553, 761 
Robust regularization, 178 
Robust statistics, 385, 760 
inliers, 319 

M-estimator, 318, 384, 761 
Rodriguez’s formula, 42 
Root mean square error (RMS), 92, 385 
Rotations, 41 

Euler angles, 41 
axis/angle, 41 
exponential twist, 43 
incremental, 45 
interpolation, 45 
quaternions, 43 
Rodriguez’s formula, 42 

Sampling, 77 

Scale invariant feature transform (SIFT), 223 
Scale-space, 13, 119, 152, 282 
Scatter matrix, 671 

between-class, 675 
within-class, 674 

Scattered data interpolation, 171, 592 
Scene completion, 709 
Scene flow, 562, 644 
Scene understanding, 712 
gist, 709, 714 
scene alignment, 715 
Schur complement, 366, 748 
Scratch removal, 521 
Seam selection 

image stitching, 456 
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Second-order cone programming (SOCP), 
367 

Seed and grow 
stereo, 543 

structure from motion, 37 1 

Segmentation, 267 

active contours, 270 
affinities, 296 
binary MRF, 182, 300 
CONDENSATION, 279 
connected components, 131, 198 
energy-based, 300 
for recognition, 704 
geodesic active contour, 282 
geodesic distance, 304 
GrabCut, 301,513 
graph cuts, 300 
graph-based, 286 
hierarchical, 285, 288 
intelligent scissors, 280 
joint feature space, 294 
k-means, 289 
level sets, 281 
mean shift, 289, 292 
medical image, 304 
merging, 286 

minimum description length (MDL), 
300 

mixture of Gaussians, 289 
Mumford-Shah, 300 
non-parametric, 292 
normalized cuts, 296 
probabilistic aggregation, 288 
random walker, 303 
snakes, 270 
splitting, 286 
stereo matching, 556 
thresholding, 127 


tobogganing, 281, 285 
watershed, 284 

weighted aggregation (SWA), 300 
Seidel aberrations, 70 
Self-calibration, 355 

bundle adjustment, 357 
Kruppa equations, 356 
Sensing, 73 

aliasing, 77, 476 
color, 80 
color balance, 86 
gamma, 87 
pipeline, 74, 471 
sampling, 77 
sampling pitch, 75 
Sensor noise, 76, 473 
amplifier, 76 
dark current, 76 
fixed pattern, 76 
shot noise, 76 

Separable filtering, 115, 197 
Shading, 65 

equation, 64 
shape-from, 580 
Shadow matting, 517 
Shape context, 249, 724 
Shape from 

focus, 584, 616 
photometric stereo, 582 
profiles, 543 
shading, 580 
silhouettes, 567 
specularities, 584 
stereo, 533 
texture, 583 

Shape parameters, 275, 681 
Shape-from-X, 14 
focus, 14 
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photometric stereo, 14 
shading, 14 
texture, 14 
Shift invariance, 112 
Shiftable multi-scale transform, 159 
Shutter speed, 75 

Signed distance function, 281, 589, 595, 597 
Silhouette-based reconstruction, 567 
octree, 569 
visual hull, 567 
Similarity transform, 36, 40 
Simulated annealing, 182, 766 
Simultaneous localization and mapping 
(SLAM), 368 

Sine filter 

interpolation, 148 
low-pass, 117 
windowed, 148 

Single view metrology, 331, 340 
Singular value decomposition (SVD), 736 
Skeletal set, 367, 372 
Skeleton, 130, 248 
Skew, 50, 52 
Skin color detection, 96 
Slant edge calibration, 476 
Slippery spring, 273 
Smoke matting, 516 
Smoothness constraint, 176 
Smoothness penalty, 176 
Snakes, 270 

ballooning, 271 
dynamic, 276 
internal energy, 270 
Kalman, 276 
shape priors, 274 
slippery spring, 273 
Soft assignment, 291 
Software, 780 


Space carving 

multi-view stereo, 566 
Spacetime stereo, 587 
Sparse flexible model, 703 
Sparse matrices, 747, 787 

compressed sparse row (CSR), 747 
skyline storage, 747 
Sparse methods 

direct, 747, 787 
iterative, 748, 787 
Spatial pyramid matching, 699 
Spectral response function, 85 
Spectral sensitivity, 85 
Specular flow, 584 
Specular reflection, 64 
Spherical coordinates, 34, 253, 256, 439 
Spherical linear interpolation, 45 
Spin image, 589 
Splatting, see Forward warping 
volumetric, 595 
Spline 

controlled continuity, 175 
octree, 409 
quadtree, 407 
thin plate, 175 

Spline-based motion estimation, 404 
Splining images, see Laplacian pyramid 
blending 

Sprites 

image-based rendering, 626 
motion estimation, 415 
video, 642 

video compression, 436 
with depth, 627 

Statistical decision theory, 757, 760 
Steerable filter, 119, 198 
Steerable pyramid, 159 
Steerable random field, 184 
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Stereo, 533 

aggregation methods, 549, 573 

coarse-to-fine, 554 

cooperative algorithms, 554 

correspondence, 535 

curve-based, 543 

dense correspondence, 545 

depth map, 535 

dynamic programming, 554 

edge-based, 543 

epipolar geometry, 537 

feature-based, 543 

global optimization, 552, 573 

graph cut, 553 

layers, 558 

local methods, 548 

model-based, 599, 624 

multi- view, 558 

non-parametric similarity measures, 547 
photoconsistency, 540 
plane sweep, 540, 572 
rectification, 538, 571 
region-based, 548 
scanline optimization, 556 
seed and grow, 543 
segmentation-based, 548, 556 
semi-global optimization, 556 
shiftable window, 560 
similarity measure, 546 
spacetime, 587 
sparse correspondence, 543 
sub-pixel refinement, 550 
support region, 548 
taxonomy, 535, 545 
uncertainty, 55 1 
window-based, 548, 573 
winner-take-all (WTA), 550 
Stereo-based head tracking, 551 


Stiffness matrix, 177 
Stitching, see Image stitching 
Stochastic gradient descent, 182 
Structural Similarity (SSIM) index, 144 
Structure from motion, 345 
affine, 359 

bas-relief ambiguity, 370 
bundle adjustment, 363 
constrained, 374 
factorization, 357 
feature tracks, 37 1 
iterative factorization, 360 
line-based, 374 
multi-frame, 357 
non-rigid, 377 
orthographic, 357 
plane-based, 362, 376 
projective factorization, 360 
seed and grow, 371 
self-calibration, 355 
skeletal set, 367, 372 
two-frame, 347 
uncertainty, 370 
Subdivision surface, 593 

subdivision connectivity, 594 
Subspace learning, 679 
Sum of absolute differences (SAD), 384, 422, 
547 

Sum of squared differences (SSD), 384, 422, 
547 

bias and gain, 386 
Fourier-based computation, 389 
normalized, 387 
surface, 210, 396 
weighted, 385 
windowed, 385 

Sum of sum of squared differences (SSSD), 
559 


Index 


955 


Summed area table, 120 
Super-resolution, 497, 529 
example-based, 499 
faces, 501 
hallucination, 499 
prior, 499 

Superposition principle, 104 
Superquadric, 597 

Support vector machine (SVM), 662, 667 
Surface element (surfel), 595 
Surface interpolation, 592 
Surface light field, 632 
Surface representations, 591 
non-parametric, 593 
parametric, 593 
point-based, 595 
simplification, 594 
splines, 593 

subdivision surface, 593 
symmetry-seeking, 593 
triangle mesh, 593 
Surface simplification, 594 
Swendsen-Wang algorithm, 766 

Telecentric lens, 48, 585 
Temporal derivative, 393, 410 
Temporal texture, 642 

Term frequency-inverse document frequency 
(TF-1DF), 689 
Testing algorithms, viii 
TextonBoost, 706 
Texture 

shape-from, 583 
Texture addressing mode, 115 
Texture map 

recovery, 610 
view-dependent, 611, 623 
Texture mapping 


anisotropic filtering, 168 
MIP-mapping, 167 
multi-pass, 169 
trilinear interpolation, 168 
Texture synthesis, 518, 531 
by numbers, 523 
hole filling, 521 
image quilting, 519 
non-parametric, 519 
transfer, 522 
Thin lens, 69 
Thin-plate spline, 175 
Thresholding, 127 

Through-the-lens camera control, 326, 368 
Tobogganing, 281, 285 
Tonal adjustment, 111, 196 
Tone mapping, 487 
adaptive, 488 
bilateral filter, 489 
global, 487 
gradient domain, 489 
halos, 489 
interactive, 493 
local, 488 
scale selection, 492 

Total least squares (TLS), 265, 398, 744 
Total variation, 179, 411, 597 
Tracking 

feature, 235 
head, 551 

human motion, 605 
multiple hypothesis, 279 
planar pattern, 326 
PTAM, 369 

Translational motion estimation, 384 
bias and gain, 386 
Transparency, 106 

Travelling salesman problem (TSP), 272 
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Tri-chromatic sensing, 8 1 
Tri-stimulus values, 81, 85 
Triangulation, 345 

Trilinear interpolation, see MIP-mapping 
Trimap (matting), 509 
Trust region method, 747 
Two-dimensional Fourier transform, 140 

Uncanny valley, 3 
Uncertainty 

correspondence, 313 
modeling, 319, 775 
weighting, 313 
Unsharp mask, 117 
Upsampling, see Interpolation 

Vanishing point 

detection, 254, 266 
Hough, 255 
least squares, 256 
modeling, 599 
uncertainty, 266 
Variable reordering, 748 
minimum degree, 748 
multi-frontal, 748 
nested dissection, 748 
Variable state dimension filter (VSDF), 367 
Variational method, 175 
Video compression 

motion compensated, 387 
Video compression (coding), 421 
Video denoising, 414 
Video matting, 518 
Video objects (coding), 415 
Video sprites, 642 
Video stabilization, 401, 423 
Video texture, 640 
Video-based animation, 639 
Video-based rendering, 638 


3D video, 643 
animating pictures, 643 
sprites, 642 
video texture, 640 
virtual viewpoint video, 644 
walkthroughs, 645 
VideoMouse, 326 
View correlation, 368 
View interpolation, 357, 621, 650 
View morphing, 357, 623, 642 
View-based eigenspace, 678 
View-dependent texture maps, 623 
Vignetting, 72, 386, 474, 527 
mechanical, 73 
natural, 72 

Virtual viewpoint video, 644 
Visual hull, 567 

image-based, 569 
Visual illusions, 3 
Visual odometry, 368 
Visual words, 234, 688, 697 
Vocabulary tree, 234, 691 
Volumetric 3D reconstruction, 562 
Volumetric range image processing (VRIP), 
589 

Volumetric representations, 596 
Voronoi diagram, 455 
Voxel coloring 

multi-view stereo, 566 

Watershed, 284, 292 
basins, 284, 292 
oriented, 285 
Wavelets, 154, 201 
compression, 201 
lifting, 156 

overcomplete, 155, 159 
second generation, 158 


Index 
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self-inverting, 159 
tight frame, 155 
weighted, 158 
Weaving wall, 544 

Weighted least squares (WLS), 493, 505 

Weighted prediction (bias and gain), 386 

White balance, 86, 97 

Whitening transform, 673 

Wiener filter, 140, 142, 198 

Wire removal, 521 

Wrapping mode, 1 15 

XYZ, see Color 

Zippering, 589 


